Preventing duplicates with MapReduce to BigQuery pipeline

Question

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.

Google App Engine: Using Big Query on datastore?

I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.

The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas

score 0 · Answer 1 · answered Jun 13 '12 at 03:55

0

Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.

I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

answered Jun 13 '12 at 03:55

Dave W. Smith

24,318
4
40
46

I've used query cursors even with fan-out, and I examined their usage for my use case. The problem is being able to finalize the blob that I'm writing to in a fan-out situation. Since fan-out spawns children jobs, it's hard to determine in my case when everything is completed and I don't want to get into writing cron jobs to check for blobs. that's why i'm trying to coordinate things really cleanly using pipelines. for my use case, accuracy and tracability are the most important. – John Wheeler Jun 13 '12 at 04:11
I'm probably missing something key about your situation. The usual thing to do is to finalize the blob at end-of-query, at the same time that you get (and save away) the query cursor so that you can pick up where you left off later. – Dave W. Smith Jun 13 '12 at 04:32
Can someone else please answer this? The answer above is not acceptable. – John Wheeler Jun 17 '12 at 14:23
Your first option sounds good to me. You filter your objects on some kind of flag entity using the MapReduce API. Well, this option does not work yet since it's bugged but it should do someday. Check this link http://stackoverflow.com/a/11851856/1387380. – Charles Aug 31 '12 at 11:58

Preventing duplicates with MapReduce to BigQuery pipeline

1 Answers1