Latest Articles from Janakiram MSV

Apache Kafka: A Primer

Apache Kafka is fast becoming the preferred messaging infrastructure for dealing with contemporary, data-centric workloads such as Internet of Things, gaming, and online advertising. The ability to ingest data at a lightening speed makes it an ideal choice for building complex data processing pipelines. In a previous article, we discussed how Kafka acts as the gateway for IoT sensor data for processing hot path and cold path analytics.

In this article, we will introduce the core concepts and terminology of Apache Kafka along with the high-level architecture.

Why Kafka?

Before the introduction of Apache Kafka, Message Oriented Middleware (MOM) such as Apache QpidRabbitMQMicrosoft Message Queue, and IBM MQ Series were used for exchanging messages across various components. While these products are good at implementing the publisher/subscriber pattern (Pub/Sub), they are not specifically designed for dealing with large streams of data originating from thousands of publishers. Most of the MOM software have a broker that exposes Advanced Message Queuing Protocol (AMQP) protocol for asynchronous communication.

Kafka is designed from the ground up to deal with millions of firehose-style events generated in rapid succession. It guarantees low latency, “at-least-once”, delivery of messages to consumers. Kafka also supports retention of data for offline consumers, which means that the data can be processed either in real-time or in offline mode.

The fundamental difference between Message Oriented Middleware and Kafka is that the clients will never receive messages automatically. They have to explicitly ask for a message when they are ready to handle.

Expanding further on the persistence and retention, Kafka is designed to be a distributed commit log. Much like relational databases, it can provide a durable record of all transactions that can be played back to recover the state of a system. The key thing to understand is that the data is stored durably in an order that can be read deterministically. Due to the distributed design, Kafka provides redundancy, which ensures high availability of data even when one of the servers faces disruption.

Read the entire article at The New Stack

is an analyst, advisor, and architect. Follow him on Twitter,  Facebook and LinkedIn.

Janakiram MSVApache Kafka: A Primer

Related Posts