Hadoop : The Definitive Guide

Hadoop
The Definitive Guide
2009 – Tom White

Organizations large and small are adopting Apache Hadoop to deal with huge application datasets. Hadoop: The Definitive Guide provides you with the key for unlocking the wealth this data holds. Hadoop is ideal for storing and processing massive

amounts of data, but until now, information on this open source project has been lacking — especially with regard to best practices. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems.Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. This book helps you: Use the Hadoop Distributed File System (HDFS) for storing large datasets, and runningdistributed computations over those datasets, using MapReduce Become familiar with Hadoop’s data and IO building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writingreal-world MapReduce programs Use Pig, a high-level query language for large-scale data processing Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems Design, build, and administer a dedicated Hadoop cluster, or runHadoop in the cloud Use HBase, Hadoop’s database for structured and semi-structured data This book includes case studies that illustrate how Hadoop is used to solve specific problems. If you’re considering Hadoop, or already use it, Hadoop: TheDefinitive Guide is the most thorough book available on the subject.


More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse of archiving of personal information that may become commonplace in the near future. MyLifeBits was an experiment where an individual’s interactions—phone calls, emails, documents—were captured electronically and stored for later access. The data gathered included a photo taken every minute, which resulted in an overall data volume of one gigabyte a month. When storage costs come down enough to make it feasible to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that.

The trend is for every individual’s data footprint to grow, but perhaps more importantly the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions—all of these contribute to the growing mountain of data.

The volume of data being made publicly available increases every year too. Organizations no longer have to merely manage their own data: success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.

. . .