-1

I'm working on a project with a dataset coming from Board Game Geek.

The issue I have concerns the name of the games I'm studying. I think the encoding worked bad so I have encoded letters in the csv file I received. For example : Orl\u00e9ans instead of Orléans

When I import the csv in Python, they remain like that and I want to correct these letters.

I manage to find the correct function I guess with this :

>>> unicodedata.normalize("NFD", 'Orl\u00e9ans')
'Orléans'

The problem is that I can't run this function through a for loop.
Indeed, the string displayed is 'Orl\u00e9ans' but in fact, it's 'Orl\\u00e9ans' so the function cannot do the job.

Is there any way to correct this ? I have 20000 entries in the dataset, I can't correct them all 1 by 1.
Thank you

EDIT I got the answer in this article : Process escape sequences in a string in Python

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Thanks a lot

  • FYI, `unicodedata.normalize` does nothing here. Try simply `print('Orl\u00e9ans')`. The escape sequence is already being interpreted by Python while parsing the string literal. – deceze Apr 15 '21 at 08:44
  • I would guess the data was originally JSON encoded…? Unicode escape sequences in JSON should be decoded to normal characters when properly JSON-parsing them, so that should be of zero concern. Is this something you must fix after the fact now, or could you simply fix how you get and treat the original data and recreate your CSVs…? – deceze Apr 15 '21 at 08:46
  • The issue is that my string contains `'Orl\\u00e9ans'` and not `'Orl\u00e9ans'` I can't change the csv file as it has been given to me to be studied in a lesson so I have to change and correct it now. In fact it's not an obligation, our instructions are : do whatever you want with this file and try to exploit it. – Romain Kerdoncuff Apr 15 '21 at 09:07
  • OK, then you'll have to open the file with the `unicode-escape` encoding or decode it as such after the fact… – deceze Apr 15 '21 at 09:09
  • I tried to use `.encode('utf-8', 'unicode-escape')` but the bytes variable I get still have the escaped backslash (\\). I might have done it wrong, I'm not very used to encoding/decoding. I think I just need to replace '\\' by '\' in my string to let Python know that this backslash is not intended to be escaped. But I can't find the way to do it. – Romain Kerdoncuff Apr 15 '21 at 09:25

1 Answers1

-2

I would try to use latin1 encoding as follows:

import codecs with codecs.open(r'$(path to your csv file)', encoding='latin1') as f: