0

The source provide the few rows with additional double quote in the CSV file.

Want to replace the additional double quotes with pipe (|) (e.g., for "G" and "Junior", below) which will be reverted to double quote after processing of the data. Numerical fields are without double quote and some values are with double quote. Blank values are defined in double quote.

1,"abc","Class (""G"") as agreed , with new Value x",7004,"rec"
2,"prq","Promoting ""Junior"" to Senior, showing good results","",x

Result should be:

1,"abc","Class (|G|) as agreed , with new Value x",7004,rec
2,"prq","Promoting |Junior| to Senior, showing good results","",x

We have to process CSV files through ADF pipeline using Copy command. Each file contains 1 mill+ records. SFTP --> ADLS --> database. We need to fix this extra double quote issue while processing the files from ADLS to database.

We are not using JSON or databricks or python. it's just a simple file load into database.

Tried below solutions -- The below link replaces comma within double quote but tried to change it as per requirement but no luck,

("[^",]+)""([^"]+")

regex to remove comma between double quotes notepad++.

Thank you in advance.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
sachin
  • 1

2 Answers2

0

You should use a CSV aware tool. One of this is Miller.

Running

mlr --csvlite -N put '$3=gsub($3,"\"\"","|")' input.csv

you get

1,"abc","Class (|G|) as agreed , with new Value x",7004,"rec"
2,"prq","Promoting |Junior| to Senior, showing good results","",x

Some notes:

  • -N to set that is a no headging CSV;
  • put is the verb to run cells functios;
  • gsub the function to run global search and replace;
  • $3 to apply it to the third field.

Your CSV has some unnecessary quotes. It would be better to have

1,abc,"Class (|G|) as agreed , with new Value x",7004,rec
2,prq,"Promoting |Junior| to Senior, showing good results",,x

To get it using miller you can change format from csvlite to csv and run

mlr --csv -N put '$3=gsub($3,"\"","|")' input.csv
aborruso
  • 4,938
  • 3
  • 23
  • 40
  • Hi @Aborruso, As mentioned earlier, the fix need to be done during data load from ADLS to database while loading the data. Thank you anyway for the solution provided. – sachin Jun 26 '23 at 07:49
-1

Thank you for all your response and solutions provided.

The issue has been fixed by change in below settings at Copy activity of ADF pipeline.

Copy activity Settings change

karel
  • 5,489
  • 46
  • 45
  • 50
sachin
  • 1