Ch-10 Batch Processing With Unix Tools Part 2
Reducing side-joins and grouping
In a relational database setting data is usually normalized for faster queries. It is common for a record to have references to other records.
Coalescing such data is usually achieved by the means of a JOIN operation. Another alternative is to denormalize data.
It can be expected that a database query is indexed, reducing the time it takes to fetch the records. MapReduce jobs don't have a concept of indexing. This leads to a full table-scan a very costly operation.
We see a common pattern in data processing:
| entity | id | log |
|---|---|---|
| user | 105 | clicked button |
| user | 296 | viewed profile of 134 |
| user | 105 | login via email campaign |
| user | 101 | changed settings |
| user | 800 | searched for keyword |
| ... | ... | ... |
| user_id | date-of-birth | |
|---|---|---|
| 101 | beth@foo.com | 1991-05-12 |
| 105 | opa@kor.ea | 1998-10-11 |
| 296 | singh@bar.co.in | 1995-03-23 |
| 800 | potato@onion.gg | 1971-08-07 |
| ... | ... | ... |
To aggregate records we would have to perform a join on the user_id field. Joining one by one would reduce the throughput,
cache efficiency would depend on data distribution, running a large number of these queries in parallel would overwhelm the database.
- To achieve a good throughput the computation should be local to a machine.
- We don't want to query a remote database since the results would be non-deterministic in that case.
- We copy the database (a backup).
- Load it in a distributed file system like HDFS.
- Use
MapReduceto aggregate the data.