0

I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.

The source software has no settings to play with to improve the situation.

Format of the csv:

"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";

Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.

In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.

What I managed to do so far:

RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)

(?:";"|^")(.*?)(?=";"|";\n)

RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.

(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)

RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers

(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)

I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.

Gaston M.
  • 3
  • 3

1 Answers1

0

Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.

When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.

This is what your row should rather look like.

"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";

So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.

In python, this would look like this:-

mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")

But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk

But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.

Community
  • 1
  • 1
Aditya
  • 3,080
  • 24
  • 47
  • Thanks for your answer. Sadly I can't change the output of the software and don't have access to it's database. And fields are comments typed by users, there is no pattern in it. I need to load this data into a mysql DB and I use "load data infile" function and it fails to handle those records. – Gaston M. Aug 08 '15 at 18:15
  • How did you get the data? You can definitely request the person responsible for handing you the data for either using some other delimiters like '\t' or using an escape character to ensure proper formatting. You must simply inform them that the data is not parse-able. – Aditya Aug 08 '15 at 18:22
  • There is absolutelly no control over the csv format just the rows and column exported. I'll look more into the awk function. From your link I learned that you can use a regex as field seperator with awk, I thought that only single char were possible. Thanks again. – Gaston M. Aug 08 '15 at 18:27