14

The syntax I see for computing statistics in hive seems to indicate the answer to the title question would be 'no':

ANALYZE TABLE [TABLENAME] PARTITION(parcol1=…, partcol2=….) COMPUTE STATISTICS

However, I wanted to throw it out here, since it i surprising that it were always required to write a script to iterate over the partitions to generate the per-partition statements. We have about a thousand partitions on this small table right now and it will be growing by orders of magnitude.

BTW I tried the following without specifying the partition:

hive> analyze table metrics compute statistics;
FAILED: SemanticException [Error 10115]: Table is partitioned and partition specification is needed
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

3 Answers3

11

Yes, you can.

At least from hive v0.13 which I'm on. Just try partition spec syntax without specific values (no =… bits)

If you're using FOR COLUMNS then you can't due to the bug: https://issues.apache.org/jira/browse/HIVE-4861

msciwoj
  • 772
  • 7
  • 23
5

I am on latest Hive 1.2 and the following command works very fine

hive> analyze table member partition(day) compute statistics noscan;
Partition mobi_mysql.member{day=20150831} stats: [numFiles=7, numRows=-1, totalSize=4735943322, rawDataSize=-1]
Partition mobi_mysql.member{day=20150901} stats: [numFiles=7, numRows=117512, totalSize=19741804, rawDataSize=0]
Partition mobi_mysql.member{day=20150902} stats: [numFiles=7, numRows=-1, totalSize=17734601, rawDataSize=-1]
Partition mobi_mysql.member{day=20150903} stats: [numFiles=7, numRows=-1, totalSize=13091084, rawDataSize=-1]
OK
Time taken: 2.089 seconds
minhas23
  • 9,291
  • 3
  • 58
  • 40
  • what does the totalSize value mean? – paolov Dec 22 '15 at 22:43
  • 1
    @user55570 It is the tolal size of files in that partition. You can check it with hadoop fs -ls ${path_to_partition}. It is in bytes. numRows are misleading or may be a bug in this utility – minhas23 Dec 23 '15 at 06:02
  • Thanks. I thought that should be reflected as rawDataSize. If that is what totalSize means, then what is rawDataSize supposed to be? – paolov Dec 24 '15 at 01:08
0

According to Hive manual if you do not specify partition specs statistics are gathered for entire table, https://cwiki.apache.org/confluence/display/Hive/StatsDev

When the user issues that command, he may or may not specify the partition specs. If the user doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any).
SKP
  • 135
  • 4
  • Yes, I see that, but there is also this comment in same section: "When computing statistics across all partitions, the partition columns still need to be listed." So I would suggest including the partition_columns (but without the =vals). – pmhargis May 22 '15 at 14:21