0

I am trying to bring in a file with a bunch of text with em dashes and/or en dashes, these are not to be confused with the regular hyphen (minus sign). The problem is that every time I read in this CSV, the dashes are turned into the replacement character (�). If I try to encode or decode the file I just get error messages about how utf-8 doesn't recognize the dashes. Do I just try to write to the CSV file from python? This just seems like a really dumb problem that should be easy to fix.

My code is:

df = pd.read_csv('csv file with em dash or en dash')
print(df)

My output is:

col_name
� �

I have tried replacing the dashes after it has been read in but that isn't working. I have also tried replacing the replacement character, but that hasn't worked either. My ideal solution would that the dashes would just show up how they are in the CSV file. I think is has something to do with how the file is being read into python but whenever I try an encoder/decoder, I just get errors that the dashes aren't supported.

  • 2
    python2 or python3? What happens if you write `print(u"\u2014")`? Is the dashed outputted correct? In case you are on windows, you know of chcp, see https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8 ? – quant Nov 15 '18 at 21:47
  • https://stackoverflow.com/questions/33307690/python-ascii-codec-cant-encode-en-dash – Xogle Nov 15 '18 at 21:47
  • You need to determine the actual encoding of the file; it seems it's *not* UTF-8. – Mark Ransom Nov 15 '18 at 21:50
  • This is in python 2.7. – not_a_comp_scientist Nov 15 '18 at 22:07
  • @quant when I print(u"\u2014") the output is the em dash, which makes it even odder that the csv is not reading properly. – not_a_comp_scientist Nov 15 '18 at 22:09
  • 1
    @mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ... – quant Nov 15 '18 at 22:12
  • @mgh5021 Is the character outputted correctly, if you execute a script with `import codecs; with codecs.open("yourFile.csv", "r", "UTF-8") as inF: print(inF.read())` ? – quant Nov 15 '18 at 22:17
  • 1
    I will try that just to see but, if i use this: df = pd.read_csv('csv file', encoding='cp1252'). It brings in the dashes. – not_a_comp_scientist Nov 15 '18 at 22:25
  • So you got it working, right? That file is not UTF8 but simply uses the default Western Windows codepage. – Jongware Nov 16 '18 at 22:39
  • Yes, I got it working. I just used encoding = 'cp1252' and it brought in the dashes. – not_a_comp_scientist Nov 19 '18 at 16:31

0 Answers0