1

I've been given the task of validating restores of crate databases from one AWS region to another. The backups and restores are already happening, but we don't have a way of validating whether the restores are succeeding without digging into each restore job individually. The method we chose to validate the restores is to get a csv containing a list of every table in the database and its associated row count before the backup in the first region, copy that to S3, and then generate a similar file after the restore in the second region, then compare the two files. However crate doesn't define "table_name" in INFORMATION_SCHEMA.tables, so this query:

SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'app');

won't work. I've figured out how to do this in bash as follows:

/apps/crate/crash -c "SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'app');" --format csv | grep -v ^\' | dos2unix > crate_tables

for x in $(cat crate_tables) ; do /apps/crate/crash -c "select count(1) from ${x};" --format csv | egrep -v "^$|^count"; done | dos2unix > crate_row_count

paste -d ',' crate_tables crate_row_count > crate_table_row_count.csv

but I've been told to do this in python. I have the beginning of a script, but I don't know how to use the output of the first query as input for the second query:

#!/usr/bin/env python3
from crate import client
connection = client.connect('10.30.76.251:4200')
cursor = connection.cursor()
cursor.execute("""
        SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'shn')""")
TableList = cursor.fetchall()
for TableName in TableList:
    print(TableName)
    cursor.execute("""
        SELECT COUNT(*) from (SELECT CONCAT(table_schema, '.', table_name) FROM INFORMATION_SCHEMA.TABLES where table_schema IN ('doc', 'shn'))""")
    RowCount = cursor.fetchone()
    print(RowCount)

# print(Table_List)
cursor.close()
connection.close()

Any assistance would be greatly appreciated.

ajhowey
  • 41
  • 4

1 Answers1

1

Maybe an alternative (easier) approach is using the sys.shards table

SELECT 
   schema_name || '.' || table_name as fqn,
   sum(num_docs) as total_records
FROM sys."shards"
WHERE schema_name IN ('doc', 'shn') AND "primary"
GROUP BY fqn;
#!/usr/bin/env python3
from crate import client
import csv

connection = client.connect('10.30.76.251:4200')
cursor = connection.cursor()

cursor.execute("""
    SELECT 
       schema_name || '.' || table_name as fqn,
       sum(num_docs) as total_records
    FROM sys."shards"
    WHERE schema_name IN ('doc', 'shn') AND "primary"
    GROUP BY fqn;""")

TableList = cursor.fetchall()

with open('path/to/csv_file', 'w', encoding='UTF8') as f:
    writer = csv.writer(f)
    for TableName in TableList:
         writer.writerow(TableName)

proddata
  • 216
  • 1
  • 7
  • 1
    If you just want to compare row count for primary shards (excluding replicas row count) add the where clause `primary = true` to the query – jayeff Nov 17 '21 at 07:10