ltbringer/blog

Ch-10 Batch Processing With Unix Tools Part 2

Reducing side-joins and grouping

In a relational database setting data is usually normalized for faster queries. It is common for a record to have references to other records. Coalescing such data is usually achieved by the means of a JOIN operation. Another alternative is to denormalize data.

It can be expected that a database query is indexed, reducing the time it takes to fetch the records. MapReduce jobs don't have a concept of indexing. This leads to a full table-scan a very costly operation.

We see a common pattern in data processing:

entity id log
user 105 clicked button
user 296 viewed profile of 134
user 105 login via email campaign
user 101 changed settings
user 800 searched for keyword
... ... ...
user_id email date-of-birth
101 beth@foo.com 1991-05-12
105 opa@kor.ea 1998-10-11
296 singh@bar.co.in 1995-03-23
800 potato@onion.gg 1971-08-07
... ... ...

To aggregate records we would have to perform a join on the user_id field. Joining one by one would reduce the throughput, cache efficiency would depend on data distribution, running a large number of these queries in parallel would overwhelm the database.

  • To achieve a good throughput the computation should be local to a machine.
  • We don't want to query a remote database since the results would be non-deterministic in that case.
  • We copy the database (a backup).
  • Load it in a distributed file system like HDFS.
  • Use MapReduce to aggregate the data.