Two-Techies - Kafka : solution to real-time event processing and data integration

Kafka : solution to real-time event processing and data integration

Published 2 years ago

Problem statement: Real-time Event Processing and Data Integrationm for example-:Uber operates in a dynamic environment where a massive volume of real-time data is generated from various sources such as user interactions, ride requests, driver updates, and operational metrics. The challenge lies in efficiently processing, managing, and analyzing this continuous stream of data to ensure timely decision-making, seamless user experiences, and operational excellence.

Solution: Kafka

What is Kafka?

Kafka is an open-source distributed event streaming platform originally developed by LinkedIn and later donated to the Apache Software Foundation. It is designed to handle high-throughput, fault-tolerant, and real-time data streaming.

Architecture of Kafka-:

1. Topics:

Kafka maintains feeds of messages in categories called topics. Each message in Kafka is a key-value pair associated with a topic. Topics are partitioned, allowing them to be distributed across multiple Kafka brokers for scalability and parallel processing. Producers publish messages to topics, and consumers subscribe to topics to consume messages.

2. Partitions:

Topics are further divided into partitions, which are ordered, immutable sequences of messages.Each partition is hosted on a single Kafka broker and can be replicated across multiple brokers for fault tolerance. Partitioning allows for parallelism and scalability, enabling Kafka to handle high-throughput workloads

4. Producers:

Producers are client applications that publish messages to Kafka topics. Producers decide which topic and partition to publish messages to, optionally specifying a key for partitioning. Producers can publish messages asynchronously for high throughput or synchronously for guaranteed message delivery.

5. Consumers:

Consumers are client applications that subscribe to Kafka topics to consume messages. Consumers read messages from partitions in a topic, maintaining their own offset to track their position in the partition. Kafka supports two types of consumer groups: single consumer and consumer groups. Single consumers read from all partitions of a topic, while consumer groups distribute partitions among multiple consumers for parallel processing.

6. ZooKeeper:

Kafka relies on Apache ZooKeeper for distributed coordination and metadata management. ZooKeeper maintains information about Kafka brokers, topics, partitions, and consumer group membership. Kafka brokers register themselves with ZooKeeper, and consumers store their current offset in ZooKeeper to resume reading from where they left off.

Why Kafka and not databases?

Databases have low throughput i.e operations performed per second where as kafka has high throughput.Databases use ROM where as Kafka uses RAM. Traditional databases have their place in storing and querying structured data with strong consistency guarantees, they may not be well-suited for handling the high-throughput, low-latency requirements of real-time data streams. Kafka complements databases by serving as a high-performance, scalable, and fault-tolerant event streaming platform that efficiently handles the ingestion and processing of real-time data.

Implementation of Kafka with Node.js-:

Prerequisite : Intermediate level of Node.js.Tools like VSCode,Docker and Zookeeper.

Assuming that docker is running

Start the Zookeeper container at port 2181 with command


docker run -p 2181:2181 zookeeper

Start the Kafka container at port 9092 with command


docker run -p 9092:9092 \
-e KAFKA_ZOOKEEPER_CONNECT=:2181 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://:9092 \
-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
confluentinc/cp-kafka

Code:

Make a new directory and run npm init in terminal.

Create a files client.js,admin.js,producer.js and client.js

client.js


const { Kafka } = require("kafkajs")
exports.kafka = new Kafka({
  clientId: "my-app",
  brokers: [":9092"],
});

admin.js


const { kafka } = require("./client");
async function init() {
  const admin = kafka.admin();
  console.log("Admin connecting...");
  admin.connect();
  console.log("Adming Connection Success...");

  console.log("Creating Topic [rider-updates]");
  await admin.createTopics({
    topics: [
      {
        topic: "rider-updates",
        numPartitions: 2,
      },
    ],
  });
  console.log("Topic Created Success [rider-updates]");
  console.log("Disconnecting Admin..");
  await admin.disconnect();
}
init();

producer.js


const { kafka } = require("./client");
const readline = require("readline");
const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
});
async function init() {
  const producer = kafka.producer();
  console.log("Connecting Producer");
  await producer.connect();
  console.log("Producer Connected Successfully");

  rl.setPrompt("> ");
  rl.prompt();

  rl.on("line", async function (line) {
    const [riderName, location] = line.split(" ");
    await producer.send({
      topic: "rider-updates",
      messages: [
        {
          partition: location.toLowerCase() === "north" ? 0 : 1,
          key: "location-update",
          value: JSON.stringify({ name: riderName, location }),
        },
      ],
    });
  }).on("close", async () => {
    await producer.disconnect();
  });
}
init();

consumer.js


const { kafka } = require("./client");
const group = process.argv[2];
async function init() {
  const consumer = kafka.consumer({ groupId: group });
  await consumer.connect();
  await consumer.subscribe({ topics: ["rider-updates"], fromBeginning: true });
  await consumer.run({
    eachMessage: async ({ topic, partition, message, heartbeat, pause }) => {
      console.log(
        `${group}: [${topic}]: PART:${partition}:`,
        message.value.toString()
      );
    },
  });
}
init();

Running Locally

Run Multiple Consumers and then run Producer


node consumer.js "group_name"


node producer.js
> deepanshu south
> dawinder north

Kafka's ability to handle high-volume data streams and ensure data durability makes it a powerful tool for real-time processing and event-driven architectures. As the important of quick response continues to evolve, Kafka is likely to remain a prominent player. If you're looking to implement Kafka in your own projects or have any questions about working with it, feel free to reach out to me!

Two Techies Blogs

DevBlog - A Blog Template Made For Developers

Kafka : solution to real-time event processing and data integration

DevBlog - A Blog Template Made For Developers

Related posts

Kafka : solution to real-time event processing and data integration