0

We have Hadoop/Hive cluster of 2 servers, on each server Hive database uses ~160GB of disk space, but Hadoop data directory is ~850GB.

Is it normal and what is typical ratio between Hive database size and Hadoop data directory size?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Napas
  • 2,692
  • 3
  • 28
  • 33
  • Have you created internal tables or external tables in hive? Please refer to http://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-external-tables in case of confusion – Abhishek Pathak Sep 29 '14 at 05:43

2 Answers2

2

This entirely depends on the type of data you are storing. The data you are storing in Hive databases is in fact a part of hadoop data directory only. If you are only storing data for Hive tables in hadoop then the ratio would be 1:1.

There is no such relation between Hive database size and Hadoop data directory size. HDFS is a super set where all data including Hive databases is stored.

Amar
  • 3,825
  • 1
  • 23
  • 26
  • In Mindaugus Case /dfs/dn/ eats the most space. And the.db File is only 150GB big. Can we get the /dfs/dn smaller ? Is there the same information stored ? – Ploetzeneder Sep 29 '14 at 05:43
2

/dfs/dn refers to the datanode size, i.e , the size of the HDFS. This is inclusive of the space occupied by hive tables, and other things in hdfs.

In case you are using hadoop to only store hive data, consider creating external tables. These will only store metadata and reuse the data already stored in hdfs folders, in contrast to an internal table which will replicate the data as well as the metadata.

Abhishek Pathak
  • 1,569
  • 1
  • 10
  • 19
  • Is there any effect on speed, if using external tables ? – Ploetzeneder Sep 29 '14 at 07:47
  • There is no speed difference.Hive just acts as a framework to run map-reduce on flat, structured data, irrespective of where it is stored on HDFS. In an internal table, hive picks the data and puts it in a location of its choice for subsequent processing. In an external table, you specifically tell hive where to look for the data. – Abhishek Pathak Sep 29 '14 at 08:41
  • OK if i alter table does it free the /dfs/dn ? – Ploetzeneder Sep 29 '14 at 10:22
  • If you have internal tables, dropping them should free up the /dfs/dn over time. Can you run a "show create table " on hive and share the output? – Abhishek Pathak Sep 29 '14 at 10:45
  • Yes i can, here: http://pastebin.com/KBY4nUj2 So how do i get this into smaller table ? And still can join it into new table. – Ploetzeneder Sep 29 '14 at 21:34