I am working with a csv dataset where the column text
contains strings written as they were bytes. An example:
df.text[1]
returns:
"b'La monarqu\xc3\xada ....(other text encoded in utf-8) ...' "
Obviously i want my text decoded from bytes to a unicode string so i tried this:
df.text.str.decode('utf-8')
But this solution returns a series of NaN
:
>>> df['text'].str.decode(encoding = 'utf-8')
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
Name: text, Length: 21120, dtype: float64
I tried this:
df.text.astype('string')
or
df.text.astype('bytes')
nothing works. Does anybody have any idea on how to solve this? Thanks in advance.
Edit
a snippet of the csv file:
int_column1,a_polynomial,text,a_boolean,a_date,int_column2
1,class1,"b'La #monarqu\xc3\xada ... some text ...'",False, Mon Apr 03 09:45:51 +0000 2021,0
How i imported the data into python:
import pandas as pd
data = pd.read_csv(r'PATH', parse_dates=['a_date'])
df = pd.DataFrame(data)
the problem is that df.text[1]
returns a string while df.text
returns a series of bytes that I can't decode for the problems shown above