0

I need to parse a csv, modify certain fields through a pandas dataframe, and then dump it again as csv. The fields are pretty odd and contain all sorts of characters, including the separator and all types of quotes. Example:

65324,1680334,13,"DATE: (at around 1 min) When 'Amelia Earhart' (qv) and 'George Palmer 
Putnam' (qv) are about to enter the railroad station, you can clearly see the \"no firearms 
permitted\" circle\\\\slash sign on the entrance door at the extreme right of the frame,
 just behind the Asian couple exiting the building - obviously not in keeping with the era.",

The things I need to do are:

  1. Inserting \N in certain fields when the field is empty (already done. In this example, the last field is empty and must be replaced).
  2. Shortening certain specific fields that are too long (also done. In this example, mthe 4th field must be shortened to 200 chars max)
  3. Making sure that nothing else breaks after parsing and dumping, and here is the issue.

This specific line, once parsed, edited the empty fields into \N, and dumped back to csv turns into this:

65324,1680334,13,"DATE: (at around 1 min) When 'Amelia Earhart' (qv) and 'George Palmer 
Putnam' (qv) are about to enter the railroad station, you can clearly see the ""no firearms
 permitted"" circle\\slash sign on the e",\N

(Both examples obviously do not actually have newlines, it is just for presentation purposes, imagine the whole thing being inline).

Issue: quotechar that is inside an already quoted field (around "no firearms permitted") is not escaped with the right escapechar (\), but rather doubled. Moreover, escapechars are not actually escaped (circle\\slash should be dumped as circle\\\\slash). If I explicitly pass the desired escapechar to df.to_csv(..., escapechar="\\"), all the '\N' are not recognized as special characters and their backslashes are also escaped, resulting in \\N (not good). How do I get away with escaping double quotes with \ and to NOT escape \Ns?

The simple solution would be to re-process the whole file with sed -i 's/\\\\N/\\N/g' file.csv or with pandas df.replace(...,...,regex=True), but I want to try and avoid that and do it directly without processing all the files once again as they are very large.

  • Because your file contains `/N` something that is not part of any csv standard, I wouldn't consider it a csv, but rather something csv-like. As such, perhaps using `to_csv()` isn't the right way to go. Perhaps writing the data out to a file row by row using `.apply()` and a function, so you can gain the level of control that you need, is the right path. Otherwise I would consider writing as csv and then a follow-up step to scrub your `/N` of the escape character. – JNevill Jun 16 '23 at 13:15
  • I found this answer https://stackoverflow.com/questions/2675323/mysql-load-null-values-from-csv-data that would allow me not to insert \N myself and therefore allow me to use escapechar='\\' – enrico_steez Jun 16 '23 at 13:33

0 Answers0