Get row count from all tables in hive

Question

How can I get row count from all tables using hive? I am interested in the database name, table name and row count

Check this [link](https://stackoverflow.com/a/45304801/7043815) hope it helps!! — satish silveri, Jul 25 '17 at 13:32

score 28 · Answer 1 · answered Feb 27 '14 at 17:28

28

You will need to do a

select count(*) from table

for all tables.

To automate this, you can make a small bash script and some bash commands. First run

$hive -e 'show tables' | tee tables.txt

This stores all tables in the database in a text file tables.txt

Create a bash file (count_tables.sh) with the following contents.

while read line
do
 echo "$line "
 eval "hive -e 'select count(*) from $line'"
done

Now run the following commands.

$chmod +x count_tables.sh
$./count_tables.sh < tables.txt > counts.txt

This creates a text file(counts.txt) with the counts of all the tables in the database

answered Feb 27 '14 at 17:28

Mukul Gupta

2,310
3
24
39

when I run this after an `INSERT INTO TABLE table SELECT row FROM another_table` I get only the number of rows added and not the total number of rows in table, do you how I can always get the total number of rows? – MoustafaAAtta Jun 11 '14 at 08:09
@Mukul the output is +-----Table-----+.... with lines starting and ending with | . Any way to grab just the table name? – AM_Hawk Feb 16 '17 at 21:48

score 4 · Answer 2 · edited Oct 23 '19 at 18:17

4

A much faster way to get approximate count of all rows in a table is to run explain on the table. In one of the explain clauses, it shows row counts like below:

TableScan [TS_0] (rows=224910 width=78)

The benefit is that you are not actually spending cluster resources to get that information.

The HQL command is explain select * from table_name; but when not optimized not shows rows in the TableScan.

edited Oct 23 '19 at 18:17

Peter Krauss

13,174
24
167
304

answered Apr 25 '17 at 20:02

Pratik Khadloya

12,509
11
81
106

2

what is "run explain on the table" is that `explain table db.table;`? Because that's not valid. – dlamblin Jul 06 '18 at 08:53

score 1 · Answer 3 · answered Feb 20 '14 at 10:44

1

select count(*) from table

I think there is no more efficient way.

answered Feb 20 '14 at 10:44

pensz

1,871
1
13
18

How can this be automated for all tables in a database. I am actually interested in a feature similar to information_schema.tables like feature in hive which would enlist record count in all tables in a database using HQL. Any thoughts – Raunak Jhawar Feb 20 '14 at 10:47
Any such feature in derby. What is the database name of the metastore by default? – Raunak Jhawar Feb 20 '14 at 11:13
Derby or MySQL just store meta info of table, that is the schema. In hive, schema of table just define a way to process data in hdfs. When you load data in hive, hive just put file in hdfs and update hdfs location in metastore. There is no counter in metastore. – pensz Feb 21 '14 at 02:17
This helps. Thank you. What can be done when such an audit feature is required to understand how many rows were loaded in all tables across databases. I would not like to issue several count * and would like to keep that as the last option. – Raunak Jhawar Feb 21 '14 at 08:35
when I run this after an `INSERT INTO TABLE table SELECT row FROM another_table` I get only the number of rows added and not the total number of rows in table, do you how I can always get the total number of rows? – MoustafaAAtta Jun 11 '14 at 08:08

marknorkin · Answer 4 · 2020-03-19T11:17:27.413

You can collect the statistics on the table by using Hive ANALAYZE command. Hive cost based optimizer makes use of these statistics to create optimal execution plan.

Below is the example of computing statistics on Hive tables:

hive> ANALYZE TABLE stud COMPUTE STATISTICS;
 Query ID = impadmin_20171115185549_a73662c3-5332-42c9-bb42-d8ccf21b7221
 Total jobs = 1
 Launching Job 1 out of 1
 …
 Table training_db.stud stats: [numFiles=5, numRows=5, totalSize=50, rawDataSize=45]
 OK
 Time taken: 8.202 seconds

Links: http://dwgeek.com/apache-hive-explain-command-example.html/

score 0 · Answer 5 · edited Feb 02 '16 at 17:33

0

You can also set the database in the same command and separate with ;.

hive -e 'use myDatabase;show tables'

edited Feb 02 '16 at 17:33

Benjamin W.

46,058
19
106
116

answered Feb 02 '16 at 17:22

Dranyar

580
1
10
16

score 0 · Answer 6 · answered May 19 '18 at 13:42

try this guys to automate-- put in shell after that run bash filename.sh

hive -e 'select count(distinct fieldid) from table1 where extracttimestamp<'2018-04-26'' > sample.out

hive -e 'select count(distinct fieldid) from table2 where day='26'' > sample.out

lc=cat sample.out | uniq | wc -l if [ $lc -eq 1 ]; then echo "PASS" else echo "FAIL" fi

score 0 · Answer 7 · edited Aug 06 '19 at 06:28

0

How do I mention the specific database that it needs to refer in below snippet:

while read line
do
 echo "$line "
 eval "hive -e 'select count(*) from $line'"
done

edited Aug 06 '19 at 06:28

Yunnosch

26,130
9
42
54

answered Aug 06 '19 at 06:26

Menta Chakravarthi

1

1

By the way, I assume that this is an answer starting off with a rethoric question. If you are asking this, please turn it instead into a separate question. – Yunnosch Aug 06 '19 at 06:28

score 0 · Answer 8 · answered Sep 03 '19 at 16:10

Here's a solution I wrote that uses python:

import os
dictTabCnt={}

print("=====Finding Tables=====")
tableList = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; show tables;\"").read().split('\n')

print("=====Finding Table Counts=====")
for i in tableList:
    if i <> '':
        strTemp = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; SELECT COUNT(*) FROM {}\"".format(i)).read()
        dictTabCnt[i] = strTemp

print("=====Table Counts=====")
for table,cnt in dictTabCnt.items():
    print("{}: {}".format(table,cnt))

depends on `select count(*) from table` – Peter Krauss Oct 23 '19 at 18:18 — Peter Krauss, Oct 23 '19 at 18:18

score 0 · Answer 9 · answered Sep 22 '22 at 05:52

Thanks to @mukul_gupta for providing shell script.

how ever we are encounting below error for the same

"bash syntax error near unexpected token done"

Solution for this at below link

BASH Syntax error near unexpected token 'done'

Also if any one need how to select DB Name

$hive -e 'use databasename;show tables' | tee tables.txt

for passing db name in select statement, give DB name in tableslist file itself.

Get row count from all tables in hive

9 Answers9

Linked