To Start Kafka Server

.\bin\windows\kafka-server-start.bat .\config\server.properties
.\bin\windows\kafka-server-stop.bat .\config\server.properties

To Start Zookeeper Server

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
.\bin\windows\zookeeper-server-stop.bat .\config\zookeeper.properties

Create Topic in Kafka with 5 Partition and Replication factor of 1

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --create --partitions 5  --replication-factor 1

Note: Replication Factor cannot be more than 1 incase of localhost.

List Topics

kafka-topics.bat  --bootstrap-server localhost:9092 --topic --list

Describe Topic

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --describe

Delete Topic

kafka-topics.bat  --bootstrap-server localhost:9092 --topic firsttopic --delete

Producer to push data into Topic in Kafka

kafka-console-producer.bat --broker-list localhost:9092 --topic test

Producer sending data into Topic as Key:Value pair

kafka-console-producer.bat --broker-list localhost:9092 --topic firsttopic  --property parse.key=true --property key.separator=:

Note:

  1. Kafka Topic with same key would end in same Partition
  2. separator should be sent in command to diff between key and value

If you try to push data to a topic which doesn’t exist after 3 attempts the topic would be created.

Consumer to pull data from Topic in Kafka

kafka-console-consumer.bat --topic test --bootstrap-server localhost:9092 --from-beginning

Print Partition, Key, Value in consumer

kafka-console-consumer.bat --topic thirdtopic --bootstrap-server localhost:9092  --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true --property print.partition=true --from-beginning

Adding consumer to consumer Group

kafka-console-consumer --bootstrap-server localhost:9092 --topic third_topic --group my-first-application

Reset Offset in Topic in all partitions

kafka-console-consumer.bat --topic thirdtopic --bootstrap-server localhost:9092  --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true --property print.partition=true --from-beginning

Note: Resetting Partition makes the consumer to read from the new offset point.

{“Message”: “Hello World from Kafka”}

How Topics, Partitions and Broker are related

Topics are logical categories or streams of data within Kafka. They act as message queues where producers publish data and consumers retrieve it
Brokers are servers that store and manage topics, and handle communication between producers and consumer
Partitions are the basic unit of data storage and distribution within Kafka topics. They are the main method of concurrency for topics, and are used to improve performance and scalability.

What is Broker Discovery?
A client that wants to send or receive messages from the Kafka cluster may connect to any broker in the cluster. Every broker in the cluster has metadata about all the other brokers and will help the client connect to them as well, and therefore any broker in the cluster is also called a bootstrap server.

  1. A client connects to a broker in the cluster
  2. The client sends a metadata request to the broker
  3. The broker responds with the cluster metadata, including a list of all brokers in the cluster
  4. The client can now connect to any broker in the cluster to produce or consume data

What is Replication Factor?
the number of copies of a topic’s partitions across different brokers. When Kafka Connects creates a topic, the replication factor should be at least 3 for a production system. A replication factor of 3 is commonly used because it balances broker loss and replication overhead

topic replication does not increase the consumer parallelism


How to choose the replication factor

It should be at least 2 and a maximum of 4. The recommended number is 3 as it provides the right balance between performance and fault tolerance, and usually cloud providers provide 3 data centers / availability zones to deploy to as part of a region.The advantage of having a higher replication factor is that it provides a better resilience of your system. If the replication factor is N, up to N-1 broker may fail without impacting availability if acks=0 or acks=1

The disadvantages of having a higher replication factor is Higher latency experienced by the producers, as the data needs to be replicated to all the replica brokers before an ack is returned if acks=all.More disk space required on your system

If there is a performance issue due to a higher replication factor, you should get a better broker instead of lowering the replication factor

Maximum Replication Factor = No of Brokers in Cluster

What is min.insync.replica?
min.insync.replicas is the minimum number of copies of the data that you are willing to have online at any time to continue running and accepting new incoming messages. min.insync.replica here is 1 by default

What is role of Zookeeper in kafka?

  1. Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
  2. Cluster membership – which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
  3. Topic configuration – which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
  4. (0.9.0) – Quotas – how much data is each client allowed to read and write
  5. (0.9.0) – ACLs – who is allowed to read and write to which topic (old high level consumer) – Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.

What is bootstrap.servers?
bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster. bootstrap.servers is a configuration we place within clients, which is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a “bootstrap” Kafka cluster that a Kafka client connects to initially to bootstrap itself.

Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list does not have to contain the full set of servers (you may want more than one, though, in case a server is down).

It is the URL of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster. The metadata consists of the topics, their partitions, the leader brokers for those partitions etc. Depending upon this metadata your producer or consumer produces or consumes the data.

You can have multiple bootstrap-servers in your producer or consumer configuration. So that if one of the broker is not accessible, then it falls back to other.

Kafka default partitioner doesnt pitch in until the data reaches 16KB

What is Consumer Group?
If more than one consumer comes togeather and tries to read topic, in such case topic which is split across various partitions would be read by various consumer in group.

In Kafka, messages are always stored using key value format, with key being the one used for determining the partition after hashing and value the actual data.

During Writing(message creation), producers uses serializers to convert the messages to bytes format. Kafka employs different kind of serializer based on the datatype which needs to be converted to byte format
Consumers uses deserizliser at their end to convert bytes to original data. Pro

It also allows custom serializer which helps in converting data to byte stream.

How Consumer reads data
Consumer keeps track of data read by having Consumer Offsets. A consumer offset in Kafka is a unique integer that tracks the position of the last message a consumer has processed in a partition
in order to “checkpoint” how far a consumer has been reading into a topic partition, the consumer will regularly commit the latest processed message, also known as consumer offset.

Offsets are important for a number of reasons, including: Data continuity: Offsets allow consumers to resume processing from where they left off if the stream application fails or shuts down.
Sequential processing: Offsets enable Kafka to process data in a sequential and ordered manner. Replayability: Offsets allow for replayable data processing.

When a consumer group is first initialized, consumers usually start reading from the earliest or latest offset in each partition. Consumers commit the offsets of messages they have processed successfully.
The position of the last available message in a partition is called the log-end offset. Consumers can store processed offsets in local variables or in-memory data structures, and then commit them in bulk.
Consumers can use a commit API to gain full control over offsets.

Kafka
One of the activity in application is to transfer data from Source System to Target System. Over period of time this communication becomes complex and messy. Kafka provides simplicity to build real-time streaming data pipelines and real-time streaming applications.Kafka Cluster sits in the middle of Source System and Target System. Data from Source System is moved to Cluster by Producer and Data from Cluster is moved from to Target System by Consumer.

        Source System -> {Producer} -> [Kafka Cluster] -> {Consumer} -> Target System

Kafka Provides Seamless integration across applications hosted in multiple platforms by acting as a intermediate.

  1. Sequence of message are called Data Stream. Topic is a particular stream of Data. Topics are organized inside cluster. Topics are like rows in a database which are identified by Topic name.
  2. Cluster Contains Topics -> Topics contains Data Stream -> Data Stream is Made of Seq of Messages
  3. To add(write) Data to Topic, we use Kafka Producer and to read data we use kafka consumers.
  4. Topics
    1. What is Topic? Topics are the categories used to organize messages.Topics are like rows in a table which are identified by Topic name(table name).
    2. Why it is Needed? Logical channel for producers to publish messages and consumers to receive them I.E. Processing payments, Tracking Assets, Monitoring patients, Tracking Customer Interactions
    3. How it works? Logical channel for producers to publish messages and consumers to receive them.A topic is a log of events. Logs are easy to understand, because they are simple data structures with well-known semantics.

  5. Partitions
    1. What are Partitions ? Topics are split into multiple partitions. Messages sent to Topic end up in these partitions, and the messages are ordered by Id(Kafka Partition Offsets). A partition in Kafka is the storage unit that allows for a topic log to be separated into multiple logs and distributed over the Kafka cluster.
      Partitions are immutable. Once data written to partition cannot be changed. Data is kept for one Week which is default configuration.
    2. Why Partition is needed? Partitions allow Kafka to scale horizontally by distributing data across multiple brokers.Multiple consumers can read from different partitions in parallel, and multiple producers can write to different partitions simultaneously. Each partition can have multiple replicas spread across different brokers.
    3. How Partition works? By breaking a single topic log into multiple logs, which are then spread across one or more brokers. This allows Kafka to scale and handle large amounts of data efficiently

  6. Broker

    1. What is Broker? is a server that manages the flow of transactions between producers and consumers in Apache Kafka. Kafka brokers store data in topics, which are divided into partitions. Each broker hosts a set of partitions. Brokers handle requests from clients to write and read events to and from partitions
    2. How Broker works?One broker acts as the Kafka controller (Kafka Broker Leader), which does administrative task, maintaining the state of the other brokers, health check of brokers and reassigning work
    3. Why Broker is required?Producers connect to a broker to write events, while consumers connect to read events.
  7. Offset
    1. What is Offset? Offset is a unique identifier for a message in a Kafka partition. An offset is an integer that represents the position of a message in a partition’s log. The first message in a partition has an offset of 0, the second message has an offset of 1, and so on.
    2. Why it needed? Offsets enable Kafka to provide sequential, ordered, and replayable data processing. This numerical value helps Kafka keep track of progress within a partition. It also allows Kafka to scale horizontally while staying fault-tolerant.
    3. How it works? When a producer publishes a message to a Kafka topic, it’s appended to the end of the partition’s log and assigned a new offset. Consumers maintain their current offset, which indicates the last processed message in each partition.
  8. Producer
    1. What is Producer? Producer writes data to Kafka broker which would be picked by consumer. A producer can send anything, but it’s typically serialized into a byte array. It can also include a message key, timestamp, compression type, and headers.
    2. How Producer works? A producer writes messages to a Kafka broker, which then adds a partition and offset ID to the message.
    3. Why Producer is needed? It allows applications to send streams of data to the Kafka cluster
    4. Producer uses partitioner to decide to which partition the data should write. Producer doesnot decided the broker rather it endup in the respective broker because of the partition presence.
    5. Producer has message keys in message which they send. If the key is null, the data is sent using a round-robin mechanism for writing. If the key is not null, then it would end up in the same partition based on the key. Message ordering is possible with key
  9. Consumer
    1. What is Consumer? Consumer reads data from Kafka broker.
    2. How Consumer works? A consumer issues fetch requests to brokers for partitions it wants to consume. It specifies a log offset, and receives a chunk of log that starts at that offset position. The consumer should know in advance the format of the message.
    3. Why Consumer is needed? It allows applications to receive streams of data from the Kafka broker
  10. Consumer Group
    1. What is Consumer Group? a collection of consumer applications that work together to process data from topics in parallel
    2. How Consumer Group works? A consumer group divides the partitions of a topic among its consumers. Each consumer is assigned a subset of partitions, and only one consumer can process a given partition.
    3. Why Consumer group is needed?allow multiple consumers to work together to process events from a topic in parallel. This is important for scalability, as it enables consumers to read from many events simultaneously.
  11. Messages

    1. What is Message?Kafka messages are created by the producer using serialization mechanism
    2. How Message works? Kafka messages are stored as serialized bytes and have a key-value structure
    3. Why Message is needed? Basic Unit of data in Kafka
    4. Key is a unique identifier of the Partition, which would be null first time. Value is the actual message. Both Key and value would be in Binary format
  12. Partioner

    1. What is Partioner?Kafka’s partitioning feature is a key part of its ability to scale and handle large amounts of data. Partioning is done by partioner
    2. How Partioner works?A Kafka partitioner uses hashing to determine which partition a message should be sent to. It employs Key hashing technique that allows for related messages to be grouped together and processed in the correct order. For example, if a Kafka producer uses a user ID as the key for messages about various users, all messages related to a specific user will be sent to the same partition.
  13. Zookeeper

    1. What is Zookeeper?ZooKeeper is a software tool that helps maintain naming and configuration data, and provides synchronization within distributed systems
    2. How Zookeeper works? Zookeeper keeps track of which brokers are part of the Kafka cluster. Zookeeper is used by Kafka brokers to determine which broker is the leader of a given partition and topic and perform leader elections. Zookeeper stores configurations for topics and permissions. Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc.…)