0

I have few records in my CSV which contain special characters. Consider an example for employee data in CSV. columns id,name,designation, address, salary 1001, Peter Occon, Manager, "42, Willis Way St, Waterloo, Ohio, US", 5000 and so on...

As you can see, I need to remove the commas and quotes present in the 'address' column in the apache beam.

Glarixon
  • 61
  • 7
  • Have you considering first splitting by `"` and then by `, `? This way you can first parse the address. – Iñigo Mar 21 '21 at 12:12

1 Answers1

1

This was acheived using this -

beam.Regex.replace_all(r'"([^"]*)"',lambda x:x.group(1).replace(',',''))

NOTE - this should be written before 'split' function in the pipeline.

Glarixon
  • 61
  • 7
  • You could also consider using https://stackoverflow.com/questions/15738918/splitting-a-csv-file-with-quotes-as-text-delimiter-using-string-split to properly handle double quotes and preserve the commas in your addresses. – robertwb Mar 22 '21 at 19:08