Eager Rebalance
Assume we have a topic with 10 partitions (0-9), and one consumer (lets name it consumer1) consuming it. When a second consumer appears (consumer2) the rebalance task triggers for both of them (consumer1 gets an event, consumer2 does the initial rebalance). Now consumer1 closes all the existing connections (even those that will be reopened soon) and releases the partition ownership in Zookeeper for all 10 partitions.

Then it runs the partition assignment algorithm and decides what partitions should be claimed and claims the partition ownership in Zookeeper again. If the claim was successful consumer1 starts fetching his new partitions.

Meanwhile consumer2 runs the partition assignment algorithm as well and tries to claim his partitions in Zookeeper as well. Claim will succeed only when consumer1 releases the ownership on these partitions. When the claim is successful consumer2 starts fetching, or if it fails to claim partitions within a given amount of retries you get a rebalance failed after n retries exception.

As you noticed instead of just closing connections and releasing ownership for partitions consumer1 does not own anymore, it unnecessarily closes ALL his connections and restarts with just a lower amount of partitions. The same story with adding partitions (when we consume by a wildcard filter and new topic appears) – ALL connections are closed and then opened again instead of just opening new ones.

What is Disadvantage of Above?
All consumers:

1.Stop consuming in order to give up their partition ownership
2.Re-join the group via the JoinGroup request
3.Receive a brand new partition assignment via the SyncGroup request, only once the rebalance finishes

There is a short window of unavailability for the entire consumer group between steps 1) and 3) – a “stop the world” event.

Three things may happen which affect the availability of data during rebalance

  1. happy path: how fast the consumers can join back
  2. not-so-happy path: the session_timeout_ms
  3. worst case: until your consumers stabilize

If your rebalance doesn’t execute successfully then rebalance would be restart again
If your consumer doesn’t join the group at all in session_timeout_ms, the group will complete the rebalance without them.When it does catch up and join – a new rebalance will start.
If your consumer starts the rebalance dance but doesn’t complete it in session_timeout_ms – the group will abort the rebalance, kick them out of the group, and begin a new rebalance.

And when it restarts back up and joins – a new rebalance will start again.

Besides the trickiness of the timeout during a rebalance, the basic cases that can start a new rebalance are:
• if any consumer fails.
• if any consumer restarts.
• if a new consumer joins the group.

So we both have:
1. plenty of common cases that trigger rebalances
2. tricky cases that can disrupt ongoing rebalances

Consumer group rebalance is nothing more than a partition reassignment between the consumers. If at all times there’s only one consumer joining/leaving the group.It does not make much sense to pause all the other consumers when they are bound to get the exact same partitions after the rebalance finishes.

To Start Kafka Server

.\bin\windows\kafka-server-start.bat .\config\server.properties
.\bin\windows\kafka-server-stop.bat .\config\server.properties

To Start Zookeeper Server

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
.\bin\windows\zookeeper-server-stop.bat .\config\zookeeper.properties

Create Topic in Kafka with 5 Partition and Replication factor of 1

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --create --partitions 5  --replication-factor 1

Note: Replication Factor cannot be more than 1 incase of localhost.

List Topics

kafka-topics.bat  --bootstrap-server localhost:9092 --topic --list

Describe Topic

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --describe

Delete Topic

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --delete

Producer to push data into Topic in Kafka

kafka-console-producer.bat --broker-list localhost:9092 --topic test

Producer sending data into Topic as Key:Value pair

kafka-console-producer.bat --broker-list localhost:9092 --topic firsttopic  --property parse.key=true --property key.separator=:

Note:

  1. Kafka Topic with same key would end in same Partition
  2. separator should be sent in command to diff between key and value

If you try to push data to a topic which doesn’t exist after 3 attempts the topic would be created.

Consumer to pull data from Topic in Kafka

kafka-console-consumer.bat --topic test --bootstrap-server localhost:9092 --from-beginning

Print Partition, Key, Value in consumer

kafka-console-consumer.bat --topic thirdtopic --bootstrap-server localhost:9092  --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true --property print.partition=true --from-beginning

Adding consumer to consumer Group

kafka-console-consumer --bootstrap-server localhost:9092 --topic third_topic --group my-first-application

Reset Offset in Topic in all partitions

kafka-console-consumer.bat --topic thirdtopic --bootstrap-server localhost:9092  --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true --property print.partition=true --from-beginning

Note: Resetting Partition makes the consumer to read from the new offset point.

{“Message”: “Hello World from Kafka”}

How Topics, Partitions and Broker are related

Topics are logical categories or streams of data within Kafka. They act as message queues where producers publish data and consumers retrieve it
Brokers are servers that store and manage topics, and handle communication between producers and consumer
Partitions are the basic unit of data storage and distribution within Kafka topics. They are the main method of concurrency for topics, and are used to improve performance and scalability.

What is Broker Discovery?
A client that wants to send or receive messages from the Kafka cluster may connect to any broker in the cluster. Every broker in the cluster has metadata about all the other brokers and will help the client connect to them as well, and therefore any broker in the cluster is also called a bootstrap server.

  1. A client connects to a broker in the cluster
  2. The client sends a metadata request to the broker
  3. The broker responds with the cluster metadata, including a list of all brokers in the cluster
  4. The client can now connect to any broker in the cluster to produce or consume data

What is Replication Factor?
the number of copies of a topic’s partitions across different brokers. When Kafka Connects creates a topic, the replication factor should be at least 3 for a production system. A replication factor of 3 is commonly used because it balances broker loss and replication overhead

topic replication does not increase the consumer parallelism


How to choose the replication factor

It should be at least 2 and a maximum of 4. The recommended number is 3 as it provides the right balance between performance and fault tolerance, and usually cloud providers provide 3 data centers / availability zones to deploy to as part of a region.The advantage of having a higher replication factor is that it provides a better resilience of your system. If the replication factor is N, up to N-1 broker may fail without impacting availability if acks=0 or acks=1

The disadvantages of having a higher replication factor is Higher latency experienced by the producers, as the data needs to be replicated to all the replica brokers before an ack is returned if acks=all.More disk space required on your system

If there is a performance issue due to a higher replication factor, you should get a better broker instead of lowering the replication factor

Maximum Replication Factor = No of Brokers in Cluster

What is min.insync.replica?
min.insync.replicas is the minimum number of copies of the data that you are willing to have online at any time to continue running and accepting new incoming messages. min.insync.replica here is 1 by default

What is role of Zookeeper in kafka?

  1. Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
  2. Cluster membership – which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
  3. Topic configuration – which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
  4. (0.9.0) – Quotas – how much data is each client allowed to read and write
  5. (0.9.0) – ACLs – who is allowed to read and write to which topic (old high level consumer) – Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.

What is bootstrap.servers?
bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster. bootstrap.servers is a configuration we place within clients, which is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a “bootstrap” Kafka cluster that a Kafka client connects to initially to bootstrap itself.

Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list does not have to contain the full set of servers (you may want more than one, though, in case a server is down).

It is the URL of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster. The metadata consists of the topics, their partitions, the leader brokers for those partitions etc. Depending upon this metadata your producer or consumer produces or consumes the data.

You can have multiple bootstrap-servers in your producer or consumer configuration. So that if one of the broker is not accessible, then it falls back to other.

Kafka default partitioner doesnt pitch in until the data reaches 16KB

What is Consumer Group?
If more than one consumer comes togeather and tries to read topic, in such case topic which is split across various partitions would be read by various consumer in group.

In Kafka, messages are always stored using key value format, with key being the one used for determining the partition after hashing and value the actual data.

During Writing(message creation), producers uses serializers to convert the messages to bytes format. Kafka employs different kind of serializer based on the datatype which needs to be converted to byte format
Consumers uses deserizliser at their end to convert bytes to original data. Pro

It also allows custom serializer which helps in converting data to byte stream.

How Consumer reads data
Consumer keeps track of data read by having Consumer Offsets. A consumer offset in Kafka is a unique integer that tracks the position of the last message a consumer has processed in a partition
in order to “checkpoint” how far a consumer has been reading into a topic partition, the consumer will regularly commit the latest processed message, also known as consumer offset.

Offsets are important for a number of reasons, including: Data continuity: Offsets allow consumers to resume processing from where they left off if the stream application fails or shuts down.
Sequential processing: Offsets enable Kafka to process data in a sequential and ordered manner. Replayability: Offsets allow for replayable data processing.

When a consumer group is first initialized, consumers usually start reading from the earliest or latest offset in each partition. Consumers commit the offsets of messages they have processed successfully.
The position of the last available message in a partition is called the log-end offset. Consumers can store processed offsets in local variables or in-memory data structures, and then commit them in bulk.
Consumers can use a commit API to gain full control over offsets.


What is Consumer Re balance?

a process by which partitions get reassigned among consumers in a group to ensure that each consumer gets an equal number of partitions to process data.

Moving partition ownership from one consumer to another is called rebalance

A Rebalance happens when:

  1. a consumer JOINS the group
  2. a consumer SHUTS DOWN cleanly
  3. a consumer is considered DEAD by the group coordinator. This may happen after a crash or when the consumer is busy with a long-running processing, which means that no heartbeats has been sent in the meanwhile by the consumer to the group coordinator within the configured session interval
  4. new partitions are added

Being a group coordinator (one of the brokers in the cluster) and a group leader (the first consumer that joins a group) designated for a consumer group, Rebalance can be more or less described as follows:

  • the leader receives a list of all consumers in the group from the group coordinator (this will include all consumers that sent a heartbeat recently and which are therefore considered alive) and is responsible for assigning a subset of partitions to each consumer.
  • After deciding on the partition assignment (Kafka has a couple built-in partition assignment policies), the group leader sends the list of assignments to the group coordinator, which sends this information to all the consumers.

Consumer rebalance initiated when consumer requests to join a group or leave a group. The Group Leader receives a list of all active consumers from the Group Coordinator. Group Leader decides partition(s) assigned to each consumer by using Partition Assigner. Once Group Leader finalize partition assignment it sends assignments list to Group Coordinator which send back this information to all consumer. Group only sends applicable partitions to their consumer not other consumer assigned partitions. Only the Group Leader aware of all consumers and their assigned partitions. After the rebalance is complete, consumers start sending Heartbeat to the Group Coordinator that it’s alive. Consumers send an OffsetFetch request to the Group Coordinator to get the last committed offsets for their assigned partitions. Consumers start consuming messaged for newly assigned partition. One of the main concept in rebalance is statemanagement.


State Management

While rebalancing, the Group coordinator set its state to Rebalance and wait for all consumers to re-join the group. When the Group starts rebalancing, the group coordinator first switches its state to rebalance so that all interacting consumers are notified to rejoin the group. Once rebalance completed Group coordinator create new generation ID and notified to all consumers and group proceed to sync stage where consumers send sync request and go to wait until group Leader finish generating new assign partition. Once consumers received a new assigned partition they moved to a stable stage.

Kafka
One of the activity in application is to transfer data from Source System to Target System. Over period of time this communication becomes complex and messy. Kafka provides simplicity to build real-time streaming data pipelines and real-time streaming applications.Kafka Cluster sits in the middle of Source System and Target System. Data from Source System is moved to Cluster by Producer and Data from Cluster is moved from to Target System by Consumer.

        Source System -> {Producer} -> [Kafka Cluster] -> {Consumer} -> Target System

Kafka Provides Seamless integration across applications hosted in multiple platforms by acting as a intermediate.

  1. Sequence of message are called Data Stream. Topic is a particular stream of Data. Topics are organized inside cluster. Topics are like rows in a database which are identified by Topic name.
  2. Cluster Contains Topics -> Topics contains Data Stream -> Data Stream is Made of Seq of Messages
  3. To add(write) Data to Topic, we use Kafka Producer and to read data we use kafka consumers.
  4. Topics
    1. What is Topic? Topics are the categories used to organize messages.Topics are like rows in a table which are identified by Topic name(table name).
    2. Why it is Needed? Logical channel for producers to publish messages and consumers to receive them I.E. Processing payments, Tracking Assets, Monitoring patients, Tracking Customer Interactions
    3. How it works? Logical channel for producers to publish messages and consumers to receive them.A topic is a log of events. Logs are easy to understand, because they are simple data structures with well-known semantics.

  5. Partitions
    1. What are Partitions ? Topics are split into multiple partitions. Messages sent to Topic end up in these partitions, and the messages are ordered by Id(Kafka Partition Offsets). A partition in Kafka is the storage unit that allows for a topic log to be separated into multiple logs and distributed over the Kafka cluster.
      Partitions are immutable. Once data written to partition cannot be changed. Data is kept for one Week which is default configuration.
    2. Why Partition is needed? Partitions allow Kafka to scale horizontally by distributing data across multiple brokers.Multiple consumers can read from different partitions in parallel, and multiple producers can write to different partitions simultaneously. Each partition can have multiple replicas spread across different brokers.
    3. How Partition works? By breaking a single topic log into multiple logs, which are then spread across one or more brokers. This allows Kafka to scale and handle large amounts of data efficiently

  6. Broker

    1. What is Broker? is a server that manages the flow of transactions between producers and consumers in Apache Kafka. Kafka brokers store data in topics, which are divided into partitions. Each broker hosts a set of partitions. Brokers handle requests from clients to write and read events to and from partitions
    2. How Broker works?One broker acts as the Kafka controller (Kafka Broker Leader), which does administrative task, maintaining the state of the other brokers, health check of brokers and reassigning work
    3. Why Broker is required?Producers connect to a broker to write events, while consumers connect to read events.
  7. Offset
    1. What is Offset? Offset is a unique identifier for a message in a Kafka partition. An offset is an integer that represents the position of a message in a partition’s log. The first message in a partition has an offset of 0, the second message has an offset of 1, and so on.
    2. Why it needed? Offsets enable Kafka to provide sequential, ordered, and replayable data processing. This numerical value helps Kafka keep track of progress within a partition. It also allows Kafka to scale horizontally while staying fault-tolerant.
    3. How it works? When a producer publishes a message to a Kafka topic, it’s appended to the end of the partition’s log and assigned a new offset. Consumers maintain their current offset, which indicates the last processed message in each partition.
  8. Producer
    1. What is Producer? Producer writes data to Kafka broker which would be picked by consumer. A producer can send anything, but it’s typically serialized into a byte array. It can also include a message key, timestamp, compression type, and headers.
    2. How Producer works? A producer writes messages to a Kafka broker, which then adds a partition and offset ID to the message.
    3. Why Producer is needed? It allows applications to send streams of data to the Kafka cluster
    4. Producer uses partitioner to decide to which partition the data should write. Producer doesnot decided the broker rather it endup in the respective broker because of the partition presence.
    5. Producer has message keys in message which they send. If the key is null, the data is sent using a round-robin mechanism for writing. If the key is not null, then it would end up in the same partition based on the key. Message ordering is possible with key
  9. Consumer
    1. What is Consumer? Consumer reads data from Kafka broker.
    2. How Consumer works? A consumer issues fetch requests to brokers for partitions it wants to consume. It specifies a log offset, and receives a chunk of log that starts at that offset position. The consumer should know in advance the format of the message.
    3. Why Consumer is needed? It allows applications to receive streams of data from the Kafka broker
  10. Consumer Group
    1. What is Consumer Group? a collection of consumer applications that work together to process data from topics in parallel
    2. How Consumer Group works? A consumer group divides the partitions of a topic among its consumers. Each consumer is assigned a subset of partitions, and only one consumer can process a given partition.
    3. Why Consumer group is needed?allow multiple consumers to work together to process events from a topic in parallel. This is important for scalability, as it enables consumers to read from many events simultaneously.
  11. Messages

    1. What is Message?Kafka messages are created by the producer using serialization mechanism
    2. How Message works? Kafka messages are stored as serialized bytes and have a key-value structure
    3. Why Message is needed? Basic Unit of data in Kafka
    4. Key is a unique identifier of the Partition, which would be null first time. Value is the actual message. Both Key and value would be in Binary format
  12. Partioner

    1. What is Partioner?Kafka’s partitioning feature is a key part of its ability to scale and handle large amounts of data. Partioning is done by partioner
    2. How Partioner works?A Kafka partitioner uses hashing to determine which partition a message should be sent to. It employs Key hashing technique that allows for related messages to be grouped together and processed in the correct order. For example, if a Kafka producer uses a user ID as the key for messages about various users, all messages related to a specific user will be sent to the same partition.
  13. Zookeeper

    1. What is Zookeeper?ZooKeeper is a software tool that helps maintain naming and configuration data, and provides synchronization within distributed systems
    2. How Zookeeper works? Zookeeper keeps track of which brokers are part of the Kafka cluster. Zookeeper is used by Kafka brokers to determine which broker is the leader of a given partition and topic and perform leader elections. Zookeeper stores configurations for topics and permissions. Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc.…)