0

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically &lt ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?

I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.

df = pd.read_csv(file_loc.csv,
                 header=None, 
                 names=column_names, 
                 usecols=counters, 
                 dtype=dtypes,
                 delimiter=';', 
                 low_memory=False)
ValueError: could not convert string to float:

As my first column is a string and the second is a float, but if the first is split by the &lt ; it then goes on the 2nd too.

Is there a way to tell pandas to ignore these or efficiently remove before loading?

DarknessFalls
  • 111
  • 2
  • 5
  • 12
  • You could try experimenting with the `escapechar=` parameter, but it would really help to have a reproducible example. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Aug 30 '19 at 19:44

1 Answers1

1

Give the following example csv file so57732330.csv:

col1;col2
1&lt;2;a
3;

we read it using StringIO after unescaping named and numeric html5 character references:

import pandas as pd
import io
import html

with open('so57732330.csv') as f:
    s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')

Result:

  col1 col2
0  1<2    a
1    3  NaN
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thanks that solves it! I have a few junk strings that our system sometimes outputs that I would like to automatically remove. Can I use stingio do to handle those too? – DarknessFalls Aug 31 '19 at 19:12
  • 1
    yes, you can remove them from `s` and then pass `s` to StringIO (or StringIO(html.unescape(s)) as in the example above). – Stef Aug 31 '19 at 19:32
  • To do that would I have to loop over every row in the csv? Tried with a for row in reader, line with a prior line set as csv.reader(s) but that's a little heavy. – DarknessFalls Aug 31 '19 at 21:55
  • depending on how these junk strings look like and the size of your csv it may be easier to pre-process the csv with `sed` and only then read it with your python program – Stef Aug 31 '19 at 22:05