getting a list of tables with row counts from cratedb

Question

I've been given the task of validating restores of crate databases from one AWS region to another. The backups and restores are already happening, but we don't have a way of validating whether the restores are succeeding without digging into each restore job individually. The method we chose to validate the restores is to get a csv containing a list of every table in the database and its associated row count before the backup in the first region, copy that to S3, and then generate a similar file after the restore in the second region, then compare the two files. However crate doesn't define "table_name" in INFORMATION_SCHEMA.tables, so this query:

SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'app');

won't work. I've figured out how to do this in bash as follows:

/apps/crate/crash -c "SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'app');" --format csv | grep -v ^\' | dos2unix > crate_tables

for x in $(cat crate_tables) ; do /apps/crate/crash -c "select count(1) from ${x};" --format csv | egrep -v "^$|^count"; done | dos2unix > crate_row_count

paste -d ',' crate_tables crate_row_count > crate_table_row_count.csv

but I've been told to do this in python. I have the beginning of a script, but I don't know how to use the output of the first query as input for the second query:

#!/usr/bin/env python3
from crate import client
connection = client.connect('10.30.76.251:4200')
cursor = connection.cursor()
cursor.execute("""
        SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'shn')""")
TableList = cursor.fetchall()
for TableName in TableList:
    print(TableName)
    cursor.execute("""
        SELECT COUNT(*) from (SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'shn'))""")
    RowCount = cursor.fetchone()
    print(RowCount)

# print(Table_List)
cursor.close()
connection.close()

Any assistance would be greatly appreciated.

proddata · Answer 1 · 2021-11-17T08:45:29.510

Maybe an alternative (easier) approach is using the sys.shards table

SELECT 
   schema_name || '.' || table_name as fqn,
   sum(num_docs) as total_records
FROM sys."shards"
WHERE schema_name IN ('doc', 'shn') AND "primary"
GROUP BY fqn;

#!/usr/bin/env python3
from crate import client
import csv

connection = client.connect('10.30.76.251:4200')
cursor = connection.cursor()

cursor.execute("""
    SELECT 
       schema_name || '.' || table_name as fqn,
       sum(num_docs) as total_records
    FROM sys."shards"
    WHERE schema_name IN ('doc', 'shn') AND "primary"
    GROUP BY fqn;""")

TableList = cursor.fetchall()

with open('path/to/csv_file', 'w', encoding='UTF8') as f:
    writer = csv.writer(f)
    for TableName in TableList:
         writer.writerow(TableName)

If you just want to compare row count for primary shards (excluding replicas row count) add the where clause `primary = true` to the query — jayeff, Nov 17 '21 at 07:10

getting a list of tables with row counts from cratedb

1 Answers1