1

I have a CSV that looks like this:

"ID"|"Description"|"Date"
"1234"|"good text"|"2019-10-12"
"9012"|"bad"text|here""|"2018-01-12"

I am trying to write to a new CSV that looks like this:

+--------+------------------+--------------+
|  "ID"  |  "Description"   |    "Date"    |
+--------+------------------+--------------+
| "1234" | "good text"      | "2019-10-12" |
| "9012" | "bad text|here"  | "2018-01-12" |
+--------+------------------+--------------+

The issue is that the delimiter "|" is contained in one of the description values, namely "bad|text". So when I run something like:

df = pd.read_csv(csv_file, encoding='utf-16', sep='|')

I see:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

The only resolution I see is to skip the rows containing the delimiter but obviously I want them:

Python Pandas Error tokenizing data

Any suggestions?

Note that file is encoded as UTF-16.

kjmerf
  • 4,275
  • 3
  • 21
  • 29

1 Answers1

1

Here is solution, the key engine='python-fwf', regex seperator and some cleaning.

import pandas as pd


df = pd.read_csv('file_in.csv', engine='python-fwf', sep='"|"', quotechar='"')
df.to_csv('file_out.csv', sep='|', index=False)

Output:

print(df)
     ID Description                 Date
0  1234   good text           2019-10-12
1  9012   bad" text  here ""|"2018-01-12

file_out.csv:

ID|Description|Date
1234|good text|2019-10-12
9012|"bad"" text"|"here """"|""2018-01-12"
Quant Christo
  • 1,275
  • 9
  • 23
  • This one resulted in: pandas.errors.ParserError: Expected 7 fields in line 3, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. – kjmerf Oct 21 '19 at 20:03
  • @kjmerf sorry my bad, I had slightly different csv. Check `engine='python-fwf'`. Result is closer but not exact yet. – Quant Christo Oct 21 '19 at 20:11
  • Still not exactly right... Run df.columns and you see you only have ID in one column and the other is Description|Date – kjmerf Oct 21 '19 at 20:36
  • I got: `Index(['ID', 'Description', 'Date'], dtype='object')` Do you use `sep='"|"'`? – Quant Christo Oct 21 '19 at 20:44
  • 1
    Yes but remember the file is UTF-16. If you are testing, save a file as UTF 16 and then try... – kjmerf Oct 21 '19 at 20:51
  • Quick idea, if `Description` and `Date` are merged into one string column, you can get `Date` by stripping last 10 characters into separate column. Not clean solution but entry data is also not clean ;) – Quant Christo Oct 21 '19 at 21:03