Is there a Hive theoretical & practical limitation on number of rows, number of columns, file size?

Question

I couldn't find any documented limitations from https://cwiki.apache.org/confluence/display/Hive/Home

My guess is there is no limit on number of rows or columns. File size is limited by the file system. By partitioning the data properly, we can also manage the file sizes and the number of files..

Thank you.

leftjoin · Accepted Answer · 2020-01-27T19:40:52.280

Number of columns:

In this jira they successfully tested it with 15K columns and 20K columns causes OOM for ORC files (with default 1GB heap). Text files probably can store even more columns: https://issues.apache.org/jira/browse/HIVE-7250 - the jira is fixed BTW.

Max file size.

Files are stored splitted in blocks and block ID is long which is max 2^63. If your block size is 64 MB then the maximum size is 512 yottabytes. So, there is no limit practically but there are other Hadoop limits.

The question is too broad for full answer, but there are few important conclusions about Hadoop scalability in this work: http://c59951.r51.cf2.rackcdn.com/5424-1908-shvachko.pdf

Namespace limitation.

The namespace consists of files and directories. Directories define the hierarchical structure of the namespace. Files—the data containers—are divided into large (128MB each) blocks.

The name-node’s metadata consist of the hierarchical namespace and a block to data-node mapping, which determines physical block locations. In order to keep the rate of metadata operations high, HDFS keeps the whole namespace in RAM. The name-node persistently stores the namespace image and its modification log in external memory such as a local or a remote hard drive. The namespace image and the journal contain the HDFS file and directory names and their attributes (modification and access times, permissions, quotas), including block IDs for files. in order to store 100 million files (referencing 200 million blocks) a name-node should have at least 60GB of RAM.

Disk space.

With 100 million files each having an average of 1.5 blocks, we will have 200 million blocks in the file system. If the maximal block size is 128MB and every block is replicated three times, then the total disk space required to store these blocks is close to 60PB.

Cluster size.

In order to accommodate data referenced by a 100 million file namespace, an HDFS cluster needs 10,000 nodes equipped with eight 1TB hard drives. The total storage capacity of such a cluster is 60PB

Internal load.

The internal load for block reports and heartbeat processing on a 10,000-node HDFS cluster with a total storage capacity of 60 PB will consume 30% of the total name-node processing capacity.

UPDATE:

All this is true about native HDFS in Hadoop 2.

Amazon S3 clamed to be much more scalable, virtually unlimited, though S3 is eventually consistent for reads after rewrites and deletes. HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

Also there are other Hadoop Compatible FileSystems (HCFS).

Also with support for erasure coding in Hadoop 3.0, the physical disk usage will be cut by half (i.e. 3x disk space consumption will reduce to 1.5x) and the fault tolerance level will increase by 50%. This new Hadoop 3.0 feature will save hadoop customers big bucks on hardware infrastructure as they can reduce the size of their hadoop cluster to half and store the same amount of data or continue to use the current hadoop cluster hardware infrastructure and store double the amount of data with HDFS EC. Read more about HDFS Erasure Coding and other Hadoop3 HDFS enhancements.

Thank you for your insights and expertise. – Joshua G Sep 21 '17 at 17:11 — Joshua G, Sep 21 '17 at 17:11

Is there a Hive theoretical & practical limitation on number of rows, number of columns, file size?

1 Answers1

Linked