2

I have been testing Apache Beam using the 2.13.0 SDK on Python 2.7.16, pulling simple messages from a Google Pub/Sub subscription in streaming mode, and writing to a Google Big Query table. As part of this operation, I'm trying to use the Pub/Sub message id for deduplication, however I can't seem to get it out at all.

The documentation for the ReadFromPubSub method and PubSubMessage type suggests that service generated KVs such as id_label should be returned as part of the attributes property, however they do not appear to be returned.

Note that the id_label parameter is only supported when using the Dataflow runner.

Code to send a message

import time
import json
from datetime import datetime

from google.cloud import pubsub_v1

project_id = "[YOUR PROJECT]"
topic_name = "test-apache-beam"

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)

def callback(message_future):
    if message_future.exception(timeout=30):
        print ('Publishing message {} threw an Exception {}.'.format(topic_name, message_future.exception()))
    else:
        print(message_future.result())

for n in range(1,11):
    data = {'rownumber':n}
    jsondata = json.dumps(data)
    message_future = publisher.publish(topic_path, data=jsondata, source='python', timestamp=datetime.now().strftime("%Y-%b-%d (%H:%M:%S:%f)"))
    message_future.add_done_callback(callback)

print('Published message IDs:')

The Beam pipeline code:-

from __future__ import absolute_import

import argparse
import logging
import re
import json
import time
import datetime
import base64
import pprint

from past.builtins import unicode

import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import ReadFromPubSub
from apache_beam.io import ReadStringsFromPubSub
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.transforms.trigger import AfterProcessingTime
from apache_beam.transforms.trigger import AccumulationMode

def format_message_element(message, timestamp=beam.DoFn.TimestampParam):

    data = json.loads(message.data)
    attribs = message.attributes

    fullmessage = {'data' : data,
                   'attributes' : attribs,
                   'attribstring' : str(message.attributes)}

    return fullmessage

def run(argv=None):

    parser = argparse.ArgumentParser()
    input_group = parser.add_mutually_exclusive_group(required=True)
    input_group.add_argument(
                        '--input_subscription',
                        dest='input_subscription',
                        help=('Input PubSub subscription of the form '
                        '"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
    input_group.add_argument(
                        '--test_input',
                        action="store_true",
                        default=False
    )
    group = parser.add_mutually_exclusive_group(required=True) 
    group.add_argument(
      '--output_table',
      dest='output_table',
      help=
      ('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
       'or DATASET.TABLE.'))
    group.add_argument(
        '--output_file',
        dest='output_file',
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    options = PipelineOptions(pipeline_args)
    options.view_as(SetupOptions).save_main_session = True

    if known_args.input_subscription:
        options.view_as(StandardOptions).streaming=True

    with beam.Pipeline(options=options) as p:

        from apache_beam.io.gcp.internal.clients import bigquery

        table_schema = bigquery.TableSchema()

        attribfield = bigquery.TableFieldSchema()
        attribfield.name = 'attributes'
        attribfield.type = 'record'
        attribfield.mode = 'nullable'

        attribsource = bigquery.TableFieldSchema()
        attribsource.name = 'source'
        attribsource.type = 'string'
        attribsource.mode = 'nullable'

        attribtimestamp = bigquery.TableFieldSchema()
        attribtimestamp.name = 'timestamp'
        attribtimestamp.type = 'string'
        attribtimestamp.mode = 'nullable'

        attribfield.fields.append(attribsource)
        attribfield.fields.append(attribtimestamp)
        table_schema.fields.append(attribfield)

        datafield = bigquery.TableFieldSchema()
        datafield.name = 'data'
        datafield.type = 'record'
        datafield.mode = 'nullable'

        datanumberfield = bigquery.TableFieldSchema()
        datanumberfield.name = 'rownumber'
        datanumberfield.type = 'integer'
        datanumberfield.mode = 'nullable'
        datafield.fields.append(datanumberfield)
        table_schema.fields.append(datafield)

        attribstringfield = bigquery.TableFieldSchema()
        attribstringfield.name = 'attribstring'
        attribstringfield.type = 'string'
        attribstringfield.mode = 'nullable'
        table_schema.fields.append(attribstringfield)

        if known_args.input_subscription:
            messages = (p
            | 'Read From Pub Sub' >> ReadFromPubSub(subscription=known_args.input_subscription,with_attributes=True,id_label='message_id')
            | 'Format Message' >> beam.Map(format_message_element)
            )

            output = (messages | 'write' >> beam.io.WriteToBigQuery(
                        known_args.output_table,
                        schema=table_schema,
                        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
                    )

    result = p.run()
    result.wait_until_finish()

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

And the code to run the python script:-

python PythonTestMessageId.py --runner DataflowRunner --project [YOURPROJECT] --input_subscription projects/[YOURPROJECT]/subscriptions/test-apache-beam.subscription --output_table [YOURPROJECT]:test.newtest --temp_location gs://[YOURPROJECT]/tmp --job_name test-job

In the code provided, I'm simply converting the dictionary of the Attributes property to a string, and inserting into a BigQuery table. The data returned in the table looks thus:-

Big Query table output

As you can see, the two properties within the attributes field are simply those that I have passed in, and the PubSub message id is not available.

Is there a way this can be returned?

Matthew Darwin
  • 325
  • 1
  • 10
  • Can you try accessing the message_id inside your `format_message_element` function: `message_id = message.message_id`? – ostrokach Jul 24 '19 at 18:37
  • I've added the following:- ```messageid = message.message_id``` and ```'message_id' : messageid``` to the dictionary, along with adding it to the BigQuery table, this returns the following error:- ```'PubsubMessage' object has no attribute 'message_id'``` – Matthew Darwin Jul 25 '19 at 09:01

2 Answers2

1

This is a known issue. A bug report has been filed in JIRA for exposing message_id in PubsubMessage. Please vote up this bug report.

Jason Ganetsky
  • 411
  • 2
  • 7
1

Looks like this may not be working as intended, and a JIRA issue has been logged: https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-7819

Matthew Darwin
  • 325
  • 1
  • 10