0%

What-is-Kafka

What can Kafka do?

  • Kafka as a Messaging System
    • Combination of two traditional queuing systems
      • queuing
        Scalable but not extendable. Data is gone after consumed
      • publish-subscribe
        Not scalable since every subscriber will receive a copy of new messages
    • Provide string order guarantee in a partition
      YES, no global ordering across multiple partitions in a topic
  • Kafka as a Storage System
    • Data is written on disk and replicated for fault-tolerance
    • Provide write on acknowledgement
    • Perform the same whether the data is 50KB or 50TB
  • Kafka for a Stream Processing
    • Handling out-of-order data
      • Use event time rather than processing time to provide a deterministic result
    • Performing stateful computations
      • uses Kafka for stateful storage
    • Reprocessing input as code changes

How does Kafka provide a High Available system?

The total number of replicas including the leader constitute the replication factor.
All reads and writes go to the leader of the partition.

Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write.

When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas.

Key properties

property Scope usage
replication.factor broker, topic The total number of replicas including the leader
min.insync.replicas broker, topic the minimum number of replicas that must acknowledge a write for the write to be considered successful
acks producer The number of acknowledgments the producer requires the leader to have received before considering a request complete

How does Kafka scale?

  • Partitioning the log
    • Multiple messages can be read/written concurrently using multiple partition on different broker
  • Optimizing throughput by batching reads and writes
    • Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.
  • Avoiding needless data copies
    • zero-copy data transfer which directly send data from disk to socket rather than moving data between pagecache, application buffer and socket buffer

The cumulative effect of these optimizations is that you can usually write and read data at the rate supported by the disk or network, even while maintaining data sets that vastly exceed memory.

What is Stream Processing and Why

I see stream processing as something much broader: infrastructure for continuous data processing. I think the computational model can be as general as MapReduce or other distributed processing frameworks, but with the ability to produce low-latency results.

  • The real driver for the processing model is the method of data collection
  • A event driven based Stream Process system can produce low-latency result

Throughput

  • 2015/03
    At the busiest times of day, we are receiving over 13 million messages per second, or 2.75 gigabytes of data per second. To handle all these messages, LinkedIn runs over 1100 Kafka brokers organized into more than 60 clusters.

References

https://kafka.apache.org/documentation/

The Log: What every software engineer should know about real-time data’s unifying abstraction

Developing a Deeper Understanding of Apache Kafka Architecture

Developing a Deeper Understanding of Apache Kafka Architecture Part 2: Write and Read Scalability

Intro to Streams | Apache Kafka® Streams API