Friday, October 6, 2023

Akka And Kotlin

Kafka - Topics, Tables and Stream - 10 min introduction

This 400 word condensed introduction to Kafka

Confluent Kafka is distributed storage and computing software which can provide high throughput and low latency.

Storage

The basic unit of Kafka storage is partitioned log called Topic, replicated across machines called brokers.

Topics are divided into partitions and Events/Messages with a same keys end up in same partitions in order, For example events from a single stock will end up on the same partition.

Partitions are replicated for durablility. The messages can flow directly from Producer to Broker-pageCache to Consumer, hence low latency. [Even before it writes to disk]

Producer

Producer fetches meta-data from cluster. The meta data contains information of partition leader. Producer then sends messages directly to leader node. It also supports batching to improve performance via linger.ms and batch.size props to send fewer large IO operations.

Consumer

Consumer reads message from Topics. It connect to the leader too. It send offset and sizes and choose to update offset before processing or after processing them. But can work as distributed system where a single machine can own a partition.

Computing

KStream is a DSL to write higher level code which can do stateful and stateless operation on Topics. This is usually a distributed process which runs in a cluster. This group membership is configured with the application.id setting for Kafka Streams apps,

Ktable: is again a distributed process which runs on a cluster and shows an table view of data in the topic.

Stream table duality

the ksql.service.id setting for ksqlDB servers in a ksqlDB cluster, and the group.id setting for applications that use one of the lower-level Kafka consumer clients in

State Stores:

Can be in memory or in a db called rocksDB. This is no concept of expiration for data in StateStores we need to use tombstone messages to clean the data.

How does kafka steam save state, and can act to share state between nodes. They need a lot of memory. This store can be in-memory or in rocks-db[persistent store].

Stream application have tasks and are balanced like consumer group with application-id.

Schema Registry

Event example

Event have key, value, timestamp.

Example: Alice, came to UK, 2023-10-4

solarpanel, 14kw, 10:10 -- Solar panel generated 14kw at 10:10

Wednesday, October 4, 2023

Kakfa reliable delivery

Reliable delivery means that once a message is sent to Kafka and gets acknowledged, It would survive broker failures.

On Producer Side:

We need to set ACKS=all and have atleast 3 insync replicas

If messageA was written using same producer before message B in same partition the offset of message A would be greater then messageB.

Messages are assumed to be committed when written to page cache of Insync replicas

Committed messages will not be lost if one insync replica survives

Assuming ACKS=all and min.isync.replicas=2

1. Producer writes to leader

2. The leader responds only when it has persisted and all followers fetch from leader. [If min.insync.replicas=2, we write to only 1 more follower, even though topic replication factor is 3]

3. Consumer read above the highwatermark*

Assuming ACKS=1 and min.isync.replicas=2

1. Producer writes to leader

2. The leader responds immediately.

[If min.insync.replicas=2, we write to only 1 more follower and only then increment highwatermark]

3. Consumer read above the highwatermark*

On Consumer Side.

Kafka uses the term high water mark to mark messages which have been fully replicated

Refer: https://rongxinblog.wordpress.com/2016/07/29/kafka-high-watermark/

*High watermark is calculated as the minimum LEO across all the ISR of this partition, and it grows monotonically.

Presto you have a reliable system

tech-indirection