What to return from apache beam pcollection to write to bigquery

Question

I am reading beam documentation and some of stackoverflow questions/answers in order to understand how would i write a pubsub message to bigquery. As of now, I have working example of getting protobuf messages and able to decode them. the code looks like this

(p
 | 'ReadData' >> apache_beam.io.ReadFromPubSub(topic=known_args.input_topic, with_attributes=True)
 | 'ParsePubsubMessage' >> apache_beam.Map(parse_pubsubmessage)
 )

Eventually, what i want to do is write decoded pub-sub message to bigquery. all attribtues (and decoded byte data) will have one-to-one column mapping.

So what is confusing me is what should my parse_pubsubmessage return. As of now, It is returning a custom class which has all fields i.e.,

class DecodedPubsubMessage:
    def __init__(self, attr, event):
        self.attribute_one = attr['attribute_one']
        self.attribute_two = attr['attribute_two']

        self.order_id = event.order.order_id
        self.sku = event.item.item_id
        self.triggered_at = event.timestamp
        self.status = event.order.status

Is this correct approach to do this dataflow? What i was thinking that i will use this returned value to write to bigquery but due to advance python feature, i am unable to understand how to. Here is a reference example that i was looking at. From this example, I am not sure how would i do the lambda map on my returned object to write to bigquery.

Do you want to save all the attributes of this class as table fields? — rmesteves, Feb 03 '20 at 13:38

score 0 · Answer 1 · answered Apr 29 '21 at 21:20

0

Your class must inherit from DoFn and overload the "process" method and not do the transformation on init

and after the transformation you can use the "return [obj]" or "yield obj" to return the desired output PCollection

answered Apr 29 '21 at 21:20

Rafael Ribeiro

21
2

What to return from apache beam pcollection to write to bigquery

1 Answers1