Streaming Processing for Big Data - なぜか数学者にはワイン好きが多い

ビッグデータが流行っているので，大量データをストリーミングにsemi-realtimeに処理できるフレームワークを探しました．

条件は，

処理の取りこぼしが無いこと
できるだけシンプルであること
各種言語へのバインディングがあること

などです．

フレームワークとしては，Dempsy，Storm，Esper，Streambase，HStreaming，Yahoo S4などがありますが，順次調べていきます．

Dempsy

比較的新しい，Dempsy.
http://dempsy.github.com/Dempsy/

What is Dempsy?
In a nutshell, Dempsy is a framework that provides for the easy implementation Stream-based, Real-time, BigData applications.
Dempsy is the Nokia's "Distributed Elastic Message Processing System."
Dempsy is Distributed. That is to say a dempsy application can run on multiple JVMs on multiple physical machines.
Dempsy is Elastic. That is, it is relatively simple to scale an application to more (or fewer) nodes. This does not require code or configuration changes but allows the dynamic insertion and removal of processing nodes.
Dempsy is Message Processing. Dempsy fundamentally works by message passing. It moves messages between Message processors, which act on the messages to perform simple atomic operations such as enrichment, transformation, or other processing. Generally an application is intended to be broken down into more smaller simpler processors rather than fewer large complex processors.
Dempsy is a Framework. It is not an application container like a J2EE container, nor a simple library. Instead, like the Spring Framework it is a collection of patterns, the libraries to enable those patterns, and the interfaces one must implement to use those libraries to implement the patterns.

Dempsyとは？
簡単に言うと，Dempsyはストリーム型でリアルタイムのBigDataアプリケーションの容易な実装を提供するフレームワークです．DempsyはNokiaの「分散型柔軟型メッセージ処理システム」です．
・Dempsyは分散型です．つまり，dempsyアプリケーションは複数の物理マシン上の複数のJVM上で実行することができます．
・Dempsyは柔軟型です．つまり，アプリケーションをスケールさせるためにノードを増減させることが比較的シンプルに可能です．コードや設定の変更無しに，動的に処理ノードの追加や削除が可能です．
・Dempsyはメッセージ処理です．つまり，Dempsyは基本的にメッセージパッシングによって動きます．一般に，アプリケーションは少ない巨大で複雑な処理ではなく小さいシンプルな処理に分解されるよう求められます．
・Dempsyはフレームワークです．J2EEコンテナのようなアプリケーションコンテナや単純なライブラリではありません．パターンの集合であるSpring Frameworkのように，パターンを実現するライブラリとインターフェースは，パターンを実装するためにライブラリを使って実装する必要があります．

What Problem is Dempsy solving?
Dempsy is not designed to be a general purpose framework, but is intended to solve a certain class of problems while encouraging the use of the best software development practices.
Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery." This class of problems includes use cases such as:
Real time monitoring of large distributed systems
Processing complete rich streams of social networking data
Real time analytics on log information generated from widely distributed systems
Statistical analytics on real-time vehicle traffic information on a global basis
It is meant to provide developers with a tool that allows them to solve these problems in a simple straightforward manner by allowing them to concentrate on the analytics themselves rather than the infrastructure. Dempsy heavily emphasizes "separation of concerns" through "dependency injection" and out of the box supports both Spring and Guice. It does all of this by supporting what can be (almost) described as a "distributed actor model."
In short Dempsy is a framework to enable decomposing a large class of message processing applications into flows of messages to relatively simple processing units implemented as POJOs

Dempsyはどのような問題を解決していますか？
Dempsyは汎用的なフレームワークとして設計されたわけではありません．しかしながら優れたソフトウェア開発業務を推奨する意味で，ある程度の範囲の問題を解決できるよう設計されています．
Dempsyは大量の「ほぼリアルタイムの」ストリームデータを，最小限の遅延可能性で処理するような問題を解決するよう，設計されています．特に，遅延が「保証された結果出力」よりも重要な問題を想定しています．このような範疇の問題には，次のようなものがあります．

巨大な分散システムのリアルタイム監視

ソーシャルネットワーキングデータのリッチストリームの処理完了

広範囲に分散化されたシステムから生成されたログ情報のリアルタイム解析

大域的なリアルタイム車両通行情報の統計解析

この意味するところは，インフラよりも解析そのものに開発者が集中できるよう，シンプルで直接的な手法で問題を解決できるツールを提供しているということです．
Dempsyは「依存性注入」(依存性の注入 - Wikipedia)を通した「関心の分離」(関心の分離 - Wikipedia)と，SpringとGuiceの枠を超えたサポートを強く支援します．これは通常「分散アクタモデル」と呼ばれるをサポートすることにより実現されています．
また，一言で言うと，Denpsyはめメッセージパッシングアプリケーションの広範囲な領域を比較的シンプルなPOJOs(Plain Old Java Object - Wikipedia)で実装される処理単位のメッセージフローに分関することを実現するフレームワークです．

Storm

そして我らが本命，Storm.
Home · nathanmarz/storm Wiki · GitHub
http://storm-project.net/documentation.html

Rationale
The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.
However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem.
Storm fills that hole.
Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has serious limitations:
Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.
Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.
Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn't lose data, scales to huge volumes of messages, and is dead-simple to use and operate?
Storm satisfies these goals.

理論的根拠
ここ10年でデータ処理革命がありました．MapReduce，Hadoop，そしてそれらに関連する技術は以前には考えられなかったようなスケールでのデータストアと処理を可能としました．しかしながら，これらはリアルタイムシステムではなく，そのように設計もされていませんでした．Hadoopをリアルタイムシステムに適用する優れた技術はありません．リアルタイムデータ処理はバッチ処理とは基本的に必要なものが違うのです．
しかし，莫大なスケールのリアルタイムデータ処理はますますビジネスに必要とされてきています．「リアルタイムのHadoop」の欠如は，データ処理界隈で最も大きな落とし穴となっています．
Stormは，この穴を埋めます．
Storm以前では，リアルタイム処理のためにキューとワーカのネットワークを自身で構築する必要があるでしょう．ワーカはキューのメッセージを処理し，データベースを更新し，次の処理のために他のキューにメッセージを送ることになるでしょう．残念ながら，このアプローチは深刻な限界があります：

つまらない：開発時間の多くをメッセージ送信，ワーカのデプロイ，待ち行列のデプロイに裂かなければなりません．本来気にかけるべきリアルタイム処理ロジックは，プログラミングに対して相対的にパーセンテージが小さくなります．

もろい：フォールト・トレランス性がありません．ワーカとキューが活動し続けさせることに責任を持つ必要があります．

スケールが大変：シングルのワーカやキューに対するメッセージのスループットが非常に高くなった場合，データがうまくばらまかれるように分割しなければなりません．そのためには一部分の移動や新規の部分を導入することになりますが，それが上手く働くとは限りません．

キューとワーカの概念は大量のメッセージに対しては失敗するにも関わらず，メッセージ処理は明らかにリアルタイム処理のための基本概念です．すると問題は，「データ損失無く，大量メッセージに対しスケールし，非常にシンプルに利用かつ操作するにはどうすれば良いのでしょうか？」ということです．
Stormなら，これらの目的を達することができます．

Why Storm is important
Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation.
The key properties of Storm are:
Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.
Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

なぜStormが重要なのでしょうか？
Stormはリアルタイム処理を行うためのプリミティブ集合を明らかにします．MapReduceが並列バッチ処理を驚くほど簡単に書けるようにするように，Stormのプリミティブは並列リアルタイム処理の記述を驚くほど簡単にします．
Stormの特徴的な性質は以下のようなものです：

利用範囲の驚異的な広さ：Stormはメッセージの処理とデータベースの更新(ストリーミング処理)に使うことができます．その際，データストリーミングの連続的なクエリを発行し，結果をクライアントにストリーミングし（連続的処理），サーチクエリのような強力なクエリのオンザフライ(分散RPC)の並列化等々を行います．Stormの小さなプリミティブ集合は，驚くほど多くの利用場面に適用できます．

スケーラブル：Stormは毎秒の莫大なメッセージ数までスケールします．トポロジーをスケールするためにやらなければならないことは，マシンを増設してトポロジーの並列設定を増加させることだけです．Stormのスケールの例として，ある10ノードクラスタnのStormの初期アプリケーションが何百もの秒間のデータベースアクセスと共に，秒間1,000,000メッセージを処理していたとしましょう．Stormのクラスタ管理のためのZookeeperの利用は，クラスタのサイズを大きくスケールさせます．

データ損失無しの保証：リアルタイムシステムはデータ処理が成功することを強く保証しなければなりません．システムがデータを損失することは，非常に限られた場合であるべきです．Stormは各メッセージが処理されることを保証します．そしてそれが，他のS4のようなシステムとの違いとなります．

極めてロバスト：運用が難しくて悪名高いHadoopのようなシステムと異なり，Stormクラスタはきちんと動きます．ユーザのStormクラスタ運用経験が苦痛でなくなるようにすることがStormプロジェクトのかかげるゴールです．

耐障害性：計算処理実行中に障害が発生したら，Stormは可能な限りタスクを再割当てし再実行しようとします．Stormは，計算処理がkillされない限り，必ず実行されることを保証します．

プログラミング言語非依存：ロバストでスケーラブルなリアルタイム処理は，一つとプラットフォームに制限されるべきではありません．Stormトポロジーとコンポーネント処理は，どんな言語でも定義可能で，ほとんどどれでもアクセス可能です．

続きは，また，いずれ．．．