We are facing some performance issues while querying BigTable.
We are getting about 500 rows/sec in the monitoring dashboard when querying for about 1400 rows, and it takes about 1.6 seconds when running the next snippet in a machine in the same region as the BigTable instance:
partial_rows = instance.table(table_name).read_rows(filter_=row_filter, row_set=row_set)
ids = {}
for row in partial_rows:
key = row.row_key.decode()
category = key[0:5]
id_, year_month = key[5:].split('_')
dates = ids.get(id_, {})
for k, v in row.cells[column_family].items():
sub_key = k.decode()
day = sub_key[0:2]
measurement = f'{category}_{sub_key[3:]}'
date = datetime.date(int(year_month[0:4]), int(year_month[4:6]), int(day))
value = struct.unpack('>i', v[0].value)[0]
measurements = dates.get(date, {})
measurements[measurement] = value
dates[date] = measurements
ids[id_] = dates
The table schema we are using is:
- Row key:
{category}{id}_{year}{month}
- Column:
{day}{measurement_name}
In our case, this schema completely follows BigTable's guidelines.
The snippet is very simple, we perform some operations with the key and the column name for creating an ids
dictionary that looks like this:
{
"4326": {
"2019-01-01": {
"value_a": 49
},
"2019-01-02": {
"value_a": 53
},
...
},
"3857": {
"2019-01-01": {
"value_a": 56
},
"2019-01-02": {
"value_a": 59
},
...
},
...
}
Do you have any idea why we are getting such low read throughput?
Is there any way to obtain all the data at once and not using the read_rows
streamed generator? Or even other solutions with the same Python client or another one, like something asynchronous.