Fast Block Transforms on Large Binary Datasets in the Cloud Using Hadoop Streaming
TBMG-22878
09/01/2015
- Content
A software framework on top of Hadoop Streaming enables both the processing of binary data in the cloud, and the freedom for the developer to implement his or her mapper and reducer programs in any language, rather than re-implementing existing solutions in Java, or repackaging existing binary data into a text format. Binary data is partitioned into chunks that are kept in a persistent data storage medium. A textual list of filenames for these chunks is piped into a Hadoop Streaming mapper program, which then reads the corresponding files, computes block transforms locally, and writes the results back to persistent data storage. The mapper program is stored on all compute nodes, and the filenames are distributed in parallel across the cluster, so that the workload is evenly distributed and the end-to-end block transform speedup is roughly given by the number of nodes in the cluster.
- Citation
- "Fast Block Transforms on Large Binary Datasets in the Cloud Using Hadoop Streaming," Mobility Engineering, September 1, 2015.