I need to parse a csv, modify certain fields through a pandas dataframe, and then dump it again as csv. The fields are pretty odd and contain all sorts of characters, including the separator and all types of quotes. Example:
65324,1680334,13,"DATE: (at around 1 min) When 'Amelia Earhart' (qv) and 'George Palmer
Putnam' (qv) are about to enter the railroad station, you can clearly see the \"no firearms
permitted\" circle\\\\slash sign on the entrance door at the extreme right of the frame,
just behind the Asian couple exiting the building - obviously not in keeping with the era.",
The things I need to do are:
- Inserting
\N
in certain fields when the field is empty (already done. In this example, the last field is empty and must be replaced). - Shortening certain specific fields that are too long (also done. In this example, mthe 4th field must be shortened to 200 chars max)
- Making sure that nothing else breaks after parsing and dumping, and here is the issue.
This specific line, once parsed, edited the empty fields into \N
, and dumped back to csv turns into this:
65324,1680334,13,"DATE: (at around 1 min) When 'Amelia Earhart' (qv) and 'George Palmer
Putnam' (qv) are about to enter the railroad station, you can clearly see the ""no firearms
permitted"" circle\\slash sign on the e",\N
(Both examples obviously do not actually have newlines, it is just for presentation purposes, imagine the whole thing being inline).
Issue: quotechar that is inside an already quoted field (around "no firearms permitted"
) is not escaped with the right escapechar (\
), but rather doubled. Moreover, escapechars are not actually escaped (circle\\slash
should be dumped as circle\\\\slash
). If I explicitly pass the desired escapechar to df.to_csv(..., escapechar="\\")
, all the '\N' are not recognized as special characters and their backslashes are also escaped, resulting in \\N
(not good).
How do I get away with escaping double quotes with \
and to NOT escape \N
s?
The simple solution would be to re-process the whole file with sed -i 's/\\\\N/\\N/g' file.csv
or with pandas df.replace(...,...,regex=True)
, but I want to try and avoid that and do it directly without processing all the files once again as they are very large.