1

My input is a list of json and I want to have a multiple elements PCollection. This is my code:

def parse_json(data):
    import json
    for i in json.loads(data):
        return i
data = (p
    | "Read text" >> beam.io.textio.ReadFromText(f'gs://{bucket_name}/not_processed/2020-06-08T23:59:59.999Z__rms004_m1__not_sent_msg.txt')
    | "Parse json" >> beam.Map(parse_json))

The thing is I only get the first element of the list when the list is composed of 2 elements.

enter image description here

How do I achieve this?

Abutreca
  • 159
  • 1
  • 13

1 Answers1

2

I found out.

There is a function called ParDo in Apache Beam just for this.

def parse_json(data):
    import json
    return json.loads(data)

data = (p
    | "Read text" >> beam.io.textio.ReadFromText(f'gs://{bucket_name}/not_processed/2020-06-08T23:59:59.999Z__rms004_m1__not_sent_msg.txt')
    | "Parse json" >> beam.ParDo(parse_json))
Abutreca
  • 159
  • 1
  • 13
  • You can also use `FlatMap`, which is built on top of `ParDo`. You can find the difference between `Map` and `FlatMap` (`ParDo`) well explained [here](https://stackoverflow.com/a/45682977/7517757) – Tlaquetzal Jun 10 '20 at 15:19