0

I'm reading a csv as follows.

data = pd.read_csv('news.csv')

It contains news and category as columns. I need to tokenize the words in news column. The problem is that each line of text in news column contains b at the beginning.

b'Longevity Increase Seen Around the World: WHO'
b'Chikungunya spreading, mosquito-borne virus ...

I tried How do I get rid of the b-prefix in a string in python? but this is for byte encoded string. So,

line = data['news'][0]
line.decode('utf-8')

would cause:

AttributeError: 'str' object has no attribute 'decode'

Each of those lines are of type str. How do I remove those b's ?

  • hi ..maybe this will be helpful https://stackoverflow.com/questions/45923189/remove-first-character-from-pandas-column-if-the-number-1 – RAVI KUMAR Oct 16 '20 at 12:03
  • If you don't know what the original encoding was `line[2:-1]` should look ok for most letters. If you know it was utf-8 `line[2:-1].encode("latin1").decode("utf-8")` should work. – Wups Oct 16 '20 at 12:29
  • If the b prefix is already visible in the csv file itself then the file is being generated incorrectly. – snakecharmerb Oct 16 '20 at 14:32

1 Answers1

1

This b'' may point to byte type that could be decoded to string '', but also could be a string itself with content b'...'.

For the first case you need line.decode(), the second case need line[2:-1].

frost-nzcr4
  • 1,540
  • 11
  • 16