0

I am working with a csv dataset where the column text contains strings written as they were bytes. An example:

df.text[1]

returns:

"b'La monarqu\xc3\xada ....(other text encoded in utf-8) ...' "

Obviously i want my text decoded from bytes to a unicode string so i tried this:

df.text.str.decode('utf-8')

But this solution returns a series of NaN:

>>> df['text'].str.decode(encoding = 'utf-8')
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
...
Name: text, Length: 21120, dtype: float64

I tried this:

df.text.astype('string')

or

df.text.astype('bytes')

nothing works. Does anybody have any idea on how to solve this? Thanks in advance.

Edit

a snippet of the csv file:

int_column1,a_polynomial,text,a_boolean,a_date,int_column2
1,class1,"b'La #monarqu\xc3\xada ... some text ...'",False, Mon Apr 03 09:45:51 +0000 2021,0

How i imported the data into python:

import pandas as pd
data = pd.read_csv(r'PATH', parse_dates=['a_date'])
df = pd.DataFrame(data)

the problem is that df.text[1] returns a string while df.text returns a series of bytes that I can't decode for the problems shown above

Giorno
  • 1
  • 1
  • I don't see any errors thrown at me, is it a question? – Giorno Apr 29 '21 at 08:59
  • What exactly does `df.text[1]` return. Is it `b'La..'` or `"b'La..'"` or `"b'La..' "`. You claim you did `df.text` but then the code you show it says `>>> df['text']` `bytes.decode(b'La monarqu\xc3\xada ....(other text encoded in utf-8) ...')` this is working fine. However when supplying a string, not bytes, it doesn't work. Provide the full error traceback. Also provide a minimum example how you create `df`. – Tin Nguyen Apr 29 '21 at 09:00
  • This is exactly the problem. The code returns exaclty what I specified in the original question, By the way thanks for the hint, I'm updating the question – Giorno Apr 29 '21 at 09:03
  • @Giorno nevermind, I was guessing that `pandas` was suppressing an error and simply returning `NaN`. By default, I just checked and the error will be raised. In any case, you really need to provide a [mcve]. – juanpa.arrivillaga Apr 29 '21 at 09:03
  • For example, can you provide the output of `df['text'].head().to_dict()`? – juanpa.arrivillaga Apr 29 '21 at 09:04
  • The problem description looks like you have _a string containing_ the Python `repr` of a byte string, not just the byte string itself. This is a fairly common problem (for reasons I can only vaguely imagine) but probably relates to how you read the CSV, or perhaps even with how the CSV was generated in the first place, and should be fixed there instead. – tripleee Apr 29 '21 at 09:06
  • @triplee yes this is the exact problem, I don't see any parameters in 'pandas.read_csv()' that might help. I tried to force columns dtypes but with no success. Unfortunately i have no control over the csv generation. btw i'm updating the question with all the suggestions – Giorno Apr 29 '21 at 09:10
  • I updated the question and i'm working on a minimal reproducible example – Giorno Apr 29 '21 at 09:22

0 Answers0