0

I have a csv file encoded in ANSI which I'm formatting with python pandas on a non ANSI machine. The resulting dataframe('df1') has some garbage in it.

Expirydate      food     color
20150713        banana   yellow
20150714        steak    brown
???             ???(g?0) ???

I am trying to remove the 'garbage' line using this:

df1[df1.Expirydate.str.contains("?")==False]

but am getting this error:

sre_constants.error: nothing to repeat

Can anybody help? It would be most appreciated!

qts
  • 984
  • 2
  • 14
  • 25
  • Can you post example of garbage values – sinhayash Jul 14 '15 at 07:57
  • a non-ansi machine? Python can read ansi, just load the csv data with `pandas.read_csv('filename', encoding='ansi')` or use python3, which solves all encoding problems automagically – firelynx Jul 14 '15 at 08:05
  • I just tried that but got the following error `unknown encoding: ansi`. Also read [here](http://stackoverflow.com/questions/22279413/python-convert-encodinglookuperror-unknown-encoding-ansi) that there is no ansi encoding in standard encodings. :-( – qts Jul 14 '15 at 08:16
  • You're right. But according to http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences you are probably looking for an existing encoding, as ansi can mean many things, but your data is definately saved with a certain encoding. The link suggests to try 'cp1252' – firelynx Jul 14 '15 at 09:18

1 Answers1

2

The pattern ? is treated as a regular expression. To actually match literal ? in the content, you can escape it:

df1[df1.Expirydate.str.contains('\?')==False]
YS-L
  • 14,358
  • 3
  • 47
  • 58
  • upvoted this answer :-) however it seems that it is still not able to pick up the 'garbage' line. I'm not sure whether it's because it can't handle regular expressions, even though escaped. Eventually I got around it by doing this `df1[~df1.Expirydate.str.contains('2015')]`. – qts Jul 14 '15 at 09:07