I have dataflow pipeline, it's in Python and this is what it is doing:
Read Message from PubSub. Messages are zipped protocol buffer. One Message receive on a PubSub contain multiple type of messages. See the protocol parent's message specification below:
message BatchEntryPoint { /** * EntryPoint * * Description: Encapsulation message */ message EntryPoint { // Proto Message google.protobuf.Any proto = 1; // Timestamp google.protobuf.Timestamp timestamp = 4; } // Array of EntryPoint messages repeated EntryPoint entrypoints = 1; }
So, to explain a bit better, I have several protobuf messages. Each message must be packed in the proto field of the EntryPoint message, we are sending several messages at once because of MQTT limitations, that's why we then use a repeated field pointing to EntryPoint message on BatchEntryPoint.
- Parsing the received messages.
Nothing fancy here, just unzipping and unserializing the message we just read from the PubSub. to get 'humain readable' data.
- For Loop on BatchEntryPoint to evaluate each EntryPoint messages.
As Each messages on BatchEntryPoint can have different type, we need to process them differently
- Parsed message data
Doing different process to get all information I need and format it to a BigQuery readable format
- Write data to bigQuery
This is where my 'trouble' begin, so my code work but it is very dirty in my opinion and hardly maintainable.
There is two things to be aware of.
Each message's type can be send to 3 different datasets, a r&d dataset, a dev dataset and a production dataset.
let's say I have a message named System.
It could go to:
- my-project:rd_dataset.system
- my-project:dev_dataset.system
- my-project:prod_dataset.system
So this is what I am doing now:
console_records | 'Write to Console BQ' >> beam.io.WriteToBigQuery(
lambda e: 'my-project:rd_dataset.table1' if dataset_is_rd_table1(e) else (
'my-project:dev_dataset.table1' if dataset_is_dev_table1(e) else (
'my-project:prod_dataset.table1' if dataset_is_prod_table1(e) else (
'my-project:rd_dataset.table2' if dataset_is_rd_table2(e) else (
'my-project:dev_dataset.table2' if dataset_is_dev_table2(e) else (
...) else 0
I have more than 30 different type of messages, making more of 90 lines for inserting data to big query.
Here is what a dataset_is_..._tableX method looks like:
def dataset_is_rd_messagestype(element) -> bool:
""" check if env is rd for message's type message """
valid: bool = False
is_type = check_element_type(element, 'MessagesType')
if is_type:
valid = dataset_is_rd(element)
return valid
check_element_type Check that the message has the right type (ex: System).
dataset_is_rd looks like this:
def dataset_is_rd(element) -> bool:
""" Check if dataset should be RD from registry id """
if element['device_registry_id'] == 'rd':
del element['device_registry_id']
del element['bq_type']
return True
return False
The element as a key indicating us on which dataset we must send the message.
SO this is working as expected, But I wish I could do cleaner code and maybe reduce the amount of code to change in case of adding or deleting a type of message.
Any ideas?