0

I have a column in a csv file which has Unicode values (\x) written as normal text. I have the following code (not mine) which im trying to use to decode this text but it is throwing a syntax error when trying to use it.

with open("fixed_datasetssscopy.csv", "r") as fp:
    file_buffer = io.StringIO()
    for line in fp.read().splitlines():
        file_buffer.write(eval('''b"{}".decode('utf-8')'''.format(line)))
        file_buffer.write('\n')
    file_buffer.seek(0)
df = pandas.from_csv(file_buffer)

When looking at the entries that throw the errors they are encased in quotes "" when I print them in my IDE, even though in the CSV file itself they are not. An example of some entries that give the errors are below.

ER...in the end it's a job. So, fair dos. https:/asdasd
When i started using Gutenberg like a month ago, I didn't care for the workflow but now it makes it easy to do thin\xe2\x80\xa6 https:/asdasd

The actual error message is:

Traceback (most recent call last):
  File "C:/Users", line 8, in <module>
    file_buffer.write(eval('''b"{}".decode('utf-8')'''.format(line)))
  File "<string>", line 1
    b""ER...in the end it's a job. So, fair dos. https://u",,,,,,,,,".decode('utf-8')
    ^
SyntaxError: invalid syntax

How can I fix this error ?

dmnte
  • 65
  • 1
  • 8
  • Why are you using `eval()` like that instead of just decoding `line` directly and passing the result to the `write()` method? – martineau Nov 05 '18 at 07:31
  • There was no need to use eval. But as I tried to run his code I was still getting an error because of \" escape character in line. The way I solved it was by first encoding line and then decoding it as file_buffer.write(line.encode('utf-8').decode('utf-8')) – Vedant Shetty Nov 05 '18 at 08:06

1 Answers1

0

You are getting the error because of the \" in your input string. I made some changes to your code to get it to work.

Quick Fix

with open("fixed_datasetssscopy.csv", "r") as fp:
    file_buffer = io.StringIO()
    for line in fp.read().splitlines():
       file_buffer.write(bytes(line, "utf-8").decode("unicode_escape"))
       file_buffer.write('\n')
    file_buffer.seek(0)
df = pandas.DataFrame.from_csv(file_buffer)

Another issue in your code is that you have used pandas.from_csv instead of pandas.DataFrame.from_csv.

Also it is recommended that you use pandas.read_csv instead of from_csv. This is because DataFrame.from_csv is now deprecated(see here).

Pandas.read_csv is also much faster than from_csv. You can find the documentation for read_csv here

Longer(Better) Solution

The above solution does not work if your input string has actual Unicode characters. In your example, it's going to output characters like ' as \'

You can go ahead with the above solution if you are certain that all characters in your input set are ASCII text

If you have ASCII sequences in your input then what you can do is manually replace the "\" in your input.

This is already done by rspeer here

Vedant Shetty
  • 1,577
  • 13
  • 14
  • thanks for your reply, its getting past that point now but is throwing an error on the last line. AttributeError: module 'pandas' has no attribute 'from_csv' – dmnte Nov 05 '18 at 08:42
  • This code runs but it does not fix the original problem I had of decoding the text and displaying the Unicode. The data-frame that is being created has the same entries that contain \x values that arent being decode. – dmnte Nov 05 '18 at 09:22
  • Made changes to the code so that it works for you now. And have also provided a link to rspeer's code. Take a look at regex function he has created if my above solution doesn't work for you – Vedant Shetty Nov 05 '18 at 10:36
  • thanks again for your reply I tried something similar previously and got a similar result. When using Unicode escape the characters are decoded but to some different character set it seems. for example instead of don't, it decodes it to donât and similarly weird stuff for other Unicode. Other than this it runs fine – dmnte Nov 05 '18 at 14:03
  • What the ' is decoded to is translated differently in the browser and in the csv, here it is just a hat, while in the url bar it is a hat and trademark symbol and also different in the csv. I tried using the example by rspeer which I was able to do but suprisingly it decodes everything to the same values as the code here. – dmnte Nov 05 '18 at 14:45
  • Sorry, I'm unable to follow. But rspeer's solution was working great for me when I tried and am unable to re-create the issue you're having. If everything is in unicode literal format then just call a decode on it should fix your issue.Right? – Vedant Shetty Nov 07 '18 at 05:18