0

I have a problem which I haven't found an adequate solution. I have a row key such as {project}#{location}#raw#{timestamp}

I would like to find, the row with the latest timestamp for a given prefix. Example: I want to find the row with the latest timestamp with project and location specified. Project1#Location1#raw#{??}

Is there any way to do that?

I guess the naive way would be to query for a long range of time, and then sort it out in python to find the latest timestamp. But I feel that is rather wasteful

limjix
  • 423
  • 1
  • 5
  • 18

2 Answers2

0

Since the timestamp is embedded in the row key itself, you will have to use a regex like the one you mentioned in the question: Project1#Location1#raw#{??} to filter the records. For sorting, as you can see in this documentation:

When Cloud Bigtable stores rows, it sorts them by row key in lexicographic order

So you don't have to sort it at all, just get the last position of the results of the query and it will be the record you want.

You mentioned you are considering using Python, in that case, you can check this example in the documentation for row key regex on how to get the data you want, after that all you have to do is print the last position of rows in that example. In order to that, as discussed in the comments you can do the following code:

rows.consume_all()
data = rows.rows
print(data) 
print(list(data)[-1])

Also, as discussed in the comments, if performance is an issue for you, consider using row prefixes instead of a filter on your seach as described here. The documentation says that Reads that use filters are slower than reads without filters and Restrict the rowset as much as possible is the first step on improving performance, so this might be a better approach than the one I suggested before.

Ralemos
  • 5,571
  • 2
  • 9
  • 18
  • Thanks Rafael, how do I get the last position of the results of the query using Regex though? Do you mind showing me? – limjix Jan 20 '21 at 01:08
  • I have edited the answer with a link to an example in the documentation that does pretty much what you need. – Ralemos Jan 20 '21 at 11:19
  • Sorry, this might be a really stupid question, please forgive me. I am aware of the row key regex, so say I narrow the regex to Project1#Location1#raw#{smtg}, bigtable will need to scan many rows in the table relevent to that regex...and return to me lots of rows to python to process in a for loop which I will need time to reach the end of the loop, surely that's not efficient? Or am I not thinking of this correctly? In a way, I am still sorting the results on the front? – limjix Jan 21 '21 at 02:24
  • well not realy "sorting" because its already sorted lexographically. But what I mean is that Bigtable still needs to return a whole bunch of data and I go through the for loop to reach the last entry – limjix Jan 21 '21 at 05:47
  • I suppose a more distilled version of my question is: In the for loop, how do I go directly to the last row in the "rows" object to get the latest entry – limjix Jan 21 '21 at 05:51
  • Also, the filter method above would be inefficient for a large set of data, even 50k rows takes 4 minutes. because Bigtable is scanning the entire table using this row key start_key = f"tu7#s777#raw#" end_key = f"tu7#s777#raw$" – limjix Jan 21 '21 at 06:22
  • ok, many additional questions hahaha 1) to get the last row you just need to do this `lastKey = rows[-1]` - simple as that as you can see [here](https://stackoverflow.com/questions/930397/getting-the-last-element-of-a-list), so you don't have to make a for at all, which takes a 50% of you workload off. – Ralemos Jan 21 '21 at 14:33
  • 2) If performance is an issue for you, consider using row prefixes instead of a filter on your seach as described [here](https://cloud.google.com/bigtable/docs/reading-data#prefix), the [documentation](https://cloud.google.com/bigtable/docs/reads#performance) says that `Reads that use filters are slower than reads without filters` and `Restrict the rowset as much as possible` is the first step on improving performance, so this might be a better approach than the one I suggested, let me know if you want me to include these 2 considerations in the answer. – Ralemos Jan 21 '21 at 14:36
  • Haha the reason im asking is because i've tried rows[-1] and it says PartialRowData is not subscriptable which makes sense because its a generator that is consuming data row by row through the grpc call. – limjix Jan 22 '21 at 00:39
  • Found a solution which is to use consume_all() first rows.consume_all() data = rows.rows print(data) print(list(data)[-1]) – limjix Jan 22 '21 at 01:23
  • So, this means you get your expected results at the end? – Ralemos Jan 25 '21 at 17:12
  • Yeah but requiires consume_all() first in order to get all the rows – limjix Jan 26 '21 at 02:04
  • Ok, I will add all that information to the answer then, please consider accepting/upvoting it if it solved the issue you faced. – Ralemos Jan 26 '21 at 16:53
0

As an alternate approach, consider creating a side-table to index the timestamps. something like {project}#{location}#{timestamp}. this will allow you to easily find the latest timestamp per project and location, at the cost of having to maintain 2 tables (2 writes, additional data etc).

FuzzyAmi
  • 7,543
  • 6
  • 45
  • 79