Wang2013 の履歴(No.24) - PukiWiki

[ トップ ] [ 新規 | 一覧 | 検索 | 最終更新 | ヘルプ ]

12/19†

This Term†

Next Term†

12/13†

This Term†

Next Term†

12/5†

This Term†

Next Term†

11/26†

This Term†

Next Term†

11/14†

This Term†

Next Term†

10/31†

This Term†

Next Term†

10/31†

This Term†

Next Term†

10/22†

This Term†

think about event-based process

Next Term†

the same

10/17†

This Term†

read book
- c++ concurrency in action
- python source code analyze
meeting with Kitagawa-sensei

Next Term†

read papers about other stream processing engines
think about event-based process and parallel process

10/03†

This Term†

improve the performance of JsSpinner
- JsSpinner is able to process 400 thousands tuples by all operators per second
implement the groupby_aggregation operator
- support sum, avg, count operation
read paper
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
- One reason of why it can achieve fault-tolerant is that it is built on HDFS, the distributed file system provide the distribution and replication functionality. It also use provenance to reproduce data and reduce the time to recover from failure
think about map-reduce operator
- map-reduce may work fine with parallel processing, but our system has just one thread. how to parallel the system may be related to the implementation of map-reduce operator.

Next Term†

prepare for the KDE seminar (10/15)

07/25†

This Term†

Next Term†

07/12†

This Term†

our client library has a new short name : JsSpinlet
write the necessary source code comment
implement the RSS wrapper
- RSS wrapper may contain some RSS feed urls, applications can register queries to select url having some "key words"
- We can be informed when some animation we are interested in is published, then we can open the url and watch that animation.
survey on the Information Source
- the best information source has JSON api providing a public timeline
- twitter has a public timeline, twitter based on public, Tweets can be seen by everybody
- facebook doesn't have a public timeline, facebook based on private, you can just see your friends' articles
- github has a public timeline

Next Term†

finish the implementation of RSS wrapper and think about map/reduce operators

07/04†

This Term†

implement a simple demo.
- Get the Tweet content and its favourite number.
read a paper
- Spark: Cluster Computing with Working Sets

Next Term†

modify the demo, and think about implementing map, reduce operator

06/28†

This Term†

Our system has its own name.
- Server : JStreamSpinner
- client library: JSpinlet
implement how JStreamSpinner interacts with wrapper( of information source)
- JStreamSpinner contains a wrapper folder, each file in this folder stands for one wrapper(information source). when JStreamSpinner starts, it will load these wrapper files.
implement the JSpinlet API
- bool connectServer(std::string serverIp, std::string serverPort);
- bool registerQuery(std::string query, void (*callBackFunction)(Element& element));
- void execute(void);
- when register query, user should provide the query string, and a callback function. whenever new data comes, this function will be called.
- the JSpinlet also has some codes to deal with BSON data, I/O. But the I/O is hide behind user. In fact, JStreamSpinner listens to a port to accept command, and JSpinlet will send register query command to the JStreamSpinner. Jpinlet will use libevent library to accept data from JStreamSpinner.
implement the twitter wrapper
- we want to get twitter articles continuously by twitter API.
- the old twitter API doesn't work last week, so we have to write code to deal with the new twitter API.
- it has a GET statuses/sample API, it returns a small random sample of all public statuses. It is very good because it is a stream API, we do authorization by oauth and send request by this API, then we will build a long connection with twitter, it will send data to us continuously, so there is no problems caused by the limitation of how many times we can call the Twitter API in one minute.
- Now, we can get 140 twitter statuses per second, it is very complex and has 35 attributes including language specification, favorite count number, the user who post this Tweet, and the Tweet itself. The longest statuse we get contains 8000 json characters.

Next Term†

look into what we get by the Twitter API

06/21†

This Term†

Question about design.
How does the user run the stream processing engine ?
One approach
- run the stream processing engine on one server.
- It listens to one port to accept command.
- User is a different process, it sends command to the server by socket(register query, register schema).
- Server supports different kinds of wrappers for different kinds of information sources.
another approach
- the user and the engine may be the same process.
- The engine classes are compiled into binary libraries.
- The engine provides a set of classes user can use.
- When the user runs the system, it should provide the wrappers implementing the information source.
- Then just compile some classes is OK, not need to re-compile all of the classes of the engine.

I prefer approach 2 because I think how to get input data and how to deal with output data should be specified by the user. The use is able to get better performance.

Next Term†

06/12†

This Term†

Next Term†

06/07†

This Term†

try to implement an client-server architecture of the system
- then client can register query and get query result from the server
- so we should think about communication between different PCs
communication mechanism
- the easiest way is to use socket directly, the best way is to use socket is I/O multiplexing or asynchronous I/O
- this is difficult to implement because we should deal with block, buffer and many other things. The best way is to use the method provided by each platform because they can support them by low-level system calls and signals.
  - linux: epoll
  - windows: iocp
- it is also not good to use epoll/iocp/kqueue directly because they are complex and not cross-platform, so there are some libraries to encapsulate them
  - libevent
  - libev
  - boost::asio
- it is not good to use libevent/libev/boost::asio directly because if we use them, we will deal with binary data directly and it is not object-orientd. It is better to use RPC(remote procedure call).
  - RCF( boost::asio)
  - eventrpc(libevent)
  - evproto(libevent)

Next Term†

go on implementing the system

05/24†

This Term†

finish the binary representation of JSON
- now each JSON document has a binary representation other than character stream
finish the memory manager
- memory allocate unit: page (each page is divided into chunks, chunk has a fix size, and one record is saved in one chunk)
finish the queue manager
- void push(Element& element);
- void pop(void);
- void front(Element& element);
- bool isEmpty(void);
- bool isFull(void);
implementing the synopsis
- window synopsis
  - void insertElement(Element& element);
  - void deleteOldestElement(void);
  - void getOldestElement(Element& element);
- lineage synopsis
  - void insertLineage(Lineage& lineage,Element outputElement);
  - void getAndDeleteElement(Lineage lineage, Element& outputElement);
- relation synopsis
  - void insertElement(Element& element);
  - void deleteElement(Element& element);

Next Term†

finish the implementation of synopsis and operators

05/17†

This Term†

study about the STREAM
design the system
prepare for the integration seminar

Next Term†

implementation

4/26†

This Term†

read the source code of STREAM
make our system run on linux
study about Emacs, gdb, make
design our system

Next Term†

design and implement our system