2

I want to scan a big-table for a list of IDs (or prefixes of IDs) (using Python HappyBase).

Is there any way to do it on server side? That is, I'd like to send a list of start/stop rows to be scanned in one API call rather than performing a long series of API calls.

Here's an example. For my_big_tables keys:

2019/1
2019/2
2019/3
...
2020/1
2020/2
2020/3
2020/4
..

In one query, I'd like to get all the records from months 1 and 2 for all years. The results should be:

2019/1
2019/2
2020/1
2020/2
linqo
  • 617
  • 4
  • 16
RELW
  • 189
  • 1
  • 14
  • Thank you for awarding the bounty! Could you please mark the answer as solved, if it worked for you? – linqo May 09 '20 at 15:35

1 Answers1

2

Rather than using the row_start and row_stop arguments in Table.scan(), this may be a better fit for the filter argument with a regular expression.

See the API reference for details on the filter argument:

The keyword argument filter is also supported (beyond column and row range filters supported here). HappyBase / HBase users will have used this as an HBase filter string. (See the Thrift docs for more details on those filters.) However, Google Cloud Bigtable doesn’t support those filter strings so a RowFilter should be used instead.

RowFilter is a type provided by Google's Bigtable library. Here are the docs. Assuming that the ID field you're referring to is your row key, we can use RowKeyRegexFilter to filter the IDs by the pattern you've described.

We'll start by coming up with a regular expression to match a list of IDs for the desired months. For example, if you wanted to filter year-based IDs for the months of December and January, you could use this (note that you must go from the largest number to the shortest) -- see this link to test the regular expression:

\d\d\d\d\/(12|1)

Here's an attempt to write a function that creates a Google Bigtable HappyBase scan call with an appropriate filter, where table is a HappyBase table and months is a list of integers. Please note that I have not tested this code, but hopefully it at least gives you a starting point.

from google.cloud.bigtable.row_filters import RowKeyRegexFilter

def filter_by_months(table, months):
    months_reversed = sorted(months, reverse=True)
    months_strings = [str(month) for month in months_reversed]
    months_joined = "|".join(months_strings)

    key_filter = RowKeyRegexFilter('\d\d\d\d\/({})'.format(months_joined))
    return table.scan(filter=key_filter)
linqo
  • 617
  • 4
  • 16