How did Facebook manipulate the Hive storage format to enable it to deal with a data warehouse that stores some 300 petabytes and takes in about 600 terabytes per day? RCFile (record-columnar file format) wasn’t enough, so enter ORCFile.
Vagata and Wilfong described the motivation behind ORCFile:
There are many areas we are innovating in to improve storage efficiency for the warehouse — building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS. The most widely used system at Facebook for large data transformations on raw logs is Hive, a query engine based on Corona Map-Reduce, used for processing and creating large tables in our data warehouse. In this post, we will focus primarily on how we evolved the Hive storage format to compress raw data as efficiently as possible into the on-disk data format.
Data that is loaded into tables in the warehouse is primarily stored using a storage format originally developed at Facebook: Record-Columnar File Format (RCFile). RCFile is a hybrid columnar storage format that is designed to provide the compression efficiency of a columnar storage format, while still allowing for row-based query processing. The core idea is to partition Hive table data first horizontally into row groups, then vertically by columns so that the columns are written out one after the other as contiguous chunks.
As the volume of data stored in our warehouse continued to grow, engineers on the team began investigating techniques to improve compression efficiency. The investigations focused on column-level encodings such as run-length encoding, dictionary encoding, frame-of-reference encoding, and better numeric encodings to reduce logical redundancies at the column level prior to running it through a codec. We also experimented with new column types (for example: JSON is used very heavily across Facebook, and storing JSON in a structured fashion allows for efficient queries, as well as removing common JSON metadata across column values). Our experiments indicated that column-specific encodings (when applied judiciously) could provide significant improvements in compression over RCFile.
Around the same time, HortonWorks had also begun investigating similar ideas for an improved storage format for Hive. The HortonWorks engineering team designed and implemented ORCFile’s on-disk representation and reader/writer. This provided us a great starting point for a new storage format for the Facebook data warehouse.
Did it work? Vagata and Wilfong wrote:
By applying all of these improvements, we evolved ORCFile to provide a significant boost in compression ratios over RCFile on our warehouse data, going from five times to eight times. Additionally, on a large representative set of queries and data from our warehouse, we found that the Facebook ORCFile writer is three times better on average than open-source ORCFile.
We have rolled out this new storage format to many 10s of petabytes of warehouse data at Facebook and have reclaimed 10s of petabytes of capacity by switching from RCFile to Facebook ORCFile as the storage format. We are in the process of rolling out the format to additional tables in our data warehouse, so that we can take further advantage of the improved storage efficiency and read/write performance. We have made our storage format available at GitHub and are working with the open source community to incorporate these improvements back into the Apache Hive project.
For much more on the technical details behind ORCFile, please see the blog post by Vagata and Wilfong.