How to remove b from the beginning of each line of string read from file?

Question

I'm reading a csv as follows.

data = pd.read_csv('news.csv')

It contains news and category as columns. I need to tokenize the words in news column. The problem is that each line of text in news column contains b at the beginning.

b'Longevity Increase Seen Around the World: WHO'
b'Chikungunya spreading, mosquito-borne virus ...

I tried How do I get rid of the b-prefix in a string in python? but this is for byte encoded string. So,

line = data['news'][0]
line.decode('utf-8')

would cause:

AttributeError: 'str' object has no attribute 'decode'

Each of those lines are of type str. How do I remove those b's ?

hi ..maybe this will be helpful https://stackoverflow.com/questions/45923189/remove-first-character-from-pandas-column-if-the-number-1 — RAVI KUMAR, Oct 16 '20 at 12:03
If you don't know what the original encoding was `line[2:-1]` should look ok for most letters. If you know it was utf-8 `line[2:-1].encode("latin1").decode("utf-8")` should work. — Wups, Oct 16 '20 at 12:29
If the b prefix is already visible in the csv file itself then the file is being generated incorrectly. — snakecharmerb, Oct 16 '20 at 14:32

score 1 · Answer 1 · answered Oct 16 '20 at 12:13

1

This b'' may point to byte type that could be decoded to string '', but also could be a string itself with content b'...'.

For the first case you need line.decode(), the second case need line[2:-1].

answered Oct 16 '20 at 12:13

frost-nzcr4

1,540
11
16

How to remove b from the beginning of each line of string read from file?

1 Answers1