This 400 word condensed introduction to Kafka
Confluent Kafka is distributed storage and computing software which can provide high throughput and low latency.
Storage
The basic unit of Kafka storage is partitioned log called Topic, replicated across machines called brokers.
Topics are divided into partitions and Events/Messages with a same keys end up in same partitions in order, For example events from a single stock will end up on the same partition.
Partitions are replicated for durablility. The messages can flow directly from Producer to Broker-pageCache to Consumer, hence low latency. [Even before it writes to disk]
Producer
Producer fetches meta-data from cluster. The meta data contains information of partition leader. Producer then sends messages directly to leader node. It also supports batching to improve performance via linger.ms and batch.size props to send fewer large IO operations.
Consumer
Consumer reads message from Topics. It connect to the leader too. It send offset and sizes and choose to update offset before processing or after processing them. But can work as distributed system where a single machine can own a partition.
Computing
KStream is a DSL to write higher level code which can do stateful and stateless operation on Topics. This is usually a distributed process which runs in a cluster. This group membership is configured with the application.id setting for Kafka Streams apps,
Ktable: is again a distributed process which runs on a cluster and shows an table view of data in the topic.
Stream table duality
State Stores:
Can be in memory or in a db called rocksDB. This is no concept of expiration for data in StateStores we need to use tombstone messages to clean the data.
How does kafka steam save state, and can act to share state between nodes. They need a lot of memory. This store can be in-memory or in rocks-db[persistent store].
Stream application have tasks and are balanced like consumer group with application-id.
No comments:
Post a Comment