Friday, October 6, 2023

Kafka - Topics, Tables and Stream - 10 min introduction

This 400 word condensed introduction to Kafka

Confluent Kafka is distributed storage and computing software which can provide high throughput and low latency.

Storage

The basic unit of Kafka storage is partitioned log called Topic, replicated across machines called brokers.

Topics are divided into partitions and Events/Messages with a same keys end up in same partitions in order, For example events from a single stock will end up on the same partition.

Partitions are replicated for durablility. The messages can flow directly from Producer to Broker-pageCache to Consumer, hence low latency. [Even before it writes to disk]

Producer

Producer fetches meta-data from cluster. The meta data contains information of partition leader. Producer then sends messages directly to leader node. It also supports batching to improve performance via linger.ms and batch.size props to send fewer large IO operations. 

Consumer

Consumer reads message from Topics. It connect to the leader too. It send offset and sizes and choose to update offset before processing or after processing them. But can work as distributed system where a single machine can own a partition.

Computing

KStream is a DSL to write higher level code which can do stateful and stateless operation on Topics. This is usually a distributed process which runs in a cluster. This group membership is configured with the application.id setting for Kafka Streams apps,

Ktable: is again a distributed process which runs on a cluster and shows an table view of data in the topic. 


Stream table duality




 the ksql.service.id setting for ksqlDB servers in a ksqlDB cluster, and the group.id setting for applications that use one of the lower-level Kafka consumer clients in


State Stores:

Can be in memory or in a db called rocksDB. This is no concept of expiration for data in StateStores we need to use tombstone messages to clean the data.

How does kafka steam save state, and can act to share state between nodes. They need a lot of memory. This store can be in-memory or in rocks-db[persistent store]. 
Stream application have tasks and are balanced like consumer group with application-id. 


Schema Registry





Event example

Event have key, value, timestamp.

Example: Alice, came to UK, 2023-10-4

                solarpanel, 14kw, 10:10 -- Solar panel generated 14kw at 10:10




No comments:

Post a Comment