I am trying to make a dict from csv data in python, I do not want to use the traditional split(',') and then using renaming the rows to the heading I would like, as I will be recieving different csv files with different amounts of information, and I will not be able to consistently target the rows I want with that method.
THE HEADER NAMES WILL BE CONSISTENT, just their maybe more headers in one file compared to another
Instead, I have been trying to formulate a list from the CSV file, then zipping the first row into the rest of the rows to create a dictionary, then I can extract the exact contents I want.
I can create a list of lists, by either using the csv.reader or :
class Split(beam.DoFn):
def process(self, element):
rows = element.splitlines()
data = []
for row in rows:
data.append([row])
return data
This returns:
[u'FIRST_NAME,last_name,birthdate,voter_id,phone_number']
[u'hector,ABAD,6/15/1970,11*******,7*********']
[u'm,ABAL,6/16/1949,12********,']
[u'jorge,ABDALA,6/15/1962,21********,3********']
[u'karen,ABELLA,6/18/1988,33********,']
Although when I try to access the first row via:
rows = element.splitlines()
data = []
for row in rows:
# f = pattern.findall(row)
data.append([row])
return data[0]
It returns:
FIRST_NAME,last_name,birthdate,voter_id,phone_number
hector,ABAD,6/15/1970,11*******,7*********
m,ABAL,6/16/1949,109055849,
jorge,ABDALA,6/15/1962,21********,3********
karen,ABELLA,6/18/1988,33********,
I have also tried the beam_utils csv reader although this says that there is no module named 'sources' after I fix the fileio bug.
If someone knows a better way or can point me towards what I'm doing wrong that would be great, also this is my pipeline:
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read' >> ReadFromText(known_args.input)
| 'Split Values' >> beam.ParDo(Split())
| 'WriteToText' >> beam.io.WriteToText(known_args.output))
I am only reading from my google-cloud storage bucket for now, but in the future it will be from pubsub.
I would like the content to look like:
{"FIRST_NAME": "hector", "last_name": "ABAD", "birthdate": "6/15/1970", "voter_id": 11*******, "phone_number": 7*********}
etc.
etc.
etc.