Fast Block Transforms on Large Binary Datasets in the Cloud Using Hadoop Streaming

TBMG-22878

09/01/2015

Abstract
Content

A software framework on top of Hadoop Streaming enables both the processing of binary data in the cloud, and the freedom for the developer to implement his or her mapper and reducer programs in any language, rather than re-implementing existing solutions in Java, or repackaging existing binary data into a text format. Binary data is partitioned into chunks that are kept in a persistent data storage medium. A textual list of filenames for these chunks is piped into a Hadoop Streaming mapper program, which then reads the corresponding files, computes block transforms locally, and writes the results back to persistent data storage. The mapper program is stored on all compute nodes, and the filenames are distributed in parallel across the cluster, so that the workload is evenly distributed and the end-to-end block transform speedup is roughly given by the number of nodes in the cluster.

Meta TagsDetails
Citation
"Fast Block Transforms on Large Binary Datasets in the Cloud Using Hadoop Streaming," Mobility Engineering, September 1, 2015.
Additional Details
Publisher
Published
Sep 1, 2015
Product Code
TBMG-22878
Content Type
Magazine Article
Language
English