0

I am new to python and I am trying to add headers to the results from a text file but I keep getting the following error :

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4.

After researching on this is because on the dollar amount when it is on the thousands, have a comma and pandas is looking as it would be an extra field (which is not). I would want to remove the comma from the dollar amount so when I add the headers it would not give me the error above.Any help is greatly appreciated.

Text file information:

19140E020603,$0.00,payment not received

13141W000119,$0.00,payment not received


99141V009055,$3,468.07,payment received

29141N005785,$0.00,payment not received

79141M009249,$3,468.07,payment received

15141Q005785,$127.73,payment received

Expected result:

id     Amount     Comments

19140E020603,$0.00,payment not received

13141W000119,$0.00,payment not received

99141V009055,$3,468.07,payment received

29141N005785,$0.00,payment not received

79141M009249,$9,398.07,payment received

15141Q005785,$127.73,payment received

Code:

import pandas as pd

fhand = open("results.txt")
result = pd.read_csv(fhand)
result.columns = ["id","Amount","Comments"]
print(result)
nickyfot
  • 1,932
  • 17
  • 25
  • Why not pip delimite the file instead of comma or use comma quote delimiter. One of the following: 19140E020603|$0.00|payment not received or "19140E020603","$0.00","payment not received" – Adrianopolis Nov 01 '19 at 17:57
  • @AntonvBR Imho your duplicate is about sth different: the csv there has nice perfect quotation marks around the questionable items, so that import of the _structure_ is not the problem, but only how to deal with commas and dollars in some columns. Here the problem is that the import itself already fails, because pandas isn't able to detect a constant number of columns throughout the whole file. – SpghttCd Nov 01 '19 at 18:51
  • I'd rather give this as an answer, as I could then do better formatting and further explanation, but in my opinion you should simply take advantage of a regex separator, which matches only commas not followed by a number: `pd.read_csv(filename, sep=',(?![0-9])', header=None, engine='python')` – SpghttCd Nov 01 '19 at 20:37

3 Answers3

0

Start from reading the source file, using read_csv, but with non-existing separator (I chose a vertical bar):

df = pd.read_csv('Input.csv', sep='|', names=['row'])

This way each row is read as a single field.

Then extract each actual field using regex:

df.row.str.extract(r'(?P<id>[^,]+),(?P<Amonut>\$[\d,]+\.\d{2}),(?P<Comments>.+)')

The key to success is proper formulation of the regex. It contains 3 named capturing groups, for each field.

If something is unclear, search the Web for a regex tutorial.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

Try this:

df = pd.read_csv("results.txt"', sep=",", header=None, names=["id", "Amount", "Comments"])

Further discussion here: Load data from txt with pandas

ggorlen
  • 44,755
  • 7
  • 76
  • 106
kalidurge
  • 279
  • 2
  • 11
0

You could use something like .replace(",", "") to change the value before it is printed.

import pandas as pd

fhand = open("results.txt")
result = pd.read_csv(fhand)
result.columns["Amount"] = result.columns["Amount"].replace(",", "")
result.columns = ["id","Amount","Comments"]
print(result)
Jortega
  • 3,616
  • 1
  • 18
  • 21