Batch Layer (Lambda Architecture)

Master dataset in lambda architecture is raw and immutable

Storage

Tradeoff between storing data in normalized form and de-normailzed form

If data is in normalized querying will be slower

Inside master dataset data should be in normalized form and In batch views data can be in denormalized form, As we will query on batch views queries will be faster

Data transfer format

Data can be serialized using some serialization protocol to transfer from one place to other

Ex : Apache Thrift, Protocol Buffers

Data storage format

Distributed KV store
Distributed file systems

In our web analytics case distributed file systems will be better

Data should be partitioned in such a way that when we want to process particular data we should not read the entire dataset, This process is called vertical partitioning

In our case we can store data based on day wise or hour wise

Ex : HDFS

Computing batch view from master data

Because your master dataset is continually growing, you must have a strategy for updating your batch views when new data becomes available. You could choose a recomputation algorithm, throwing away the old batch views and recomputing functions over the entire master dataset which may not be practically possible all the time.

Incremental algorithm will update the views directly when new data arrives.

Ex : Map-Reduce Paradigm

Batch views for web analytics queries

Page views over time range

A view can be generated per hour basis counts for every URL

To optimize queries for longer time range one can store aggregated counts also over day/month/year

Unique visitors over time range

Unique over time range cannot be solved accurately unless you fetch entire data and uniqify them

If we can have some error probability we can generate HyperLogLog for every hour and get unique count over that

To optimize queries for longer time range one can store aggregated HyperLogLog over day/month/year

Bounce rate analysis

Similar to page views we just need to maintain bounce count and total count

02-Batch-Layer