15

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.

Now, the text data is stored like this:

"b'Lorem Ipsum\xc2\xa0Assignment '"

I tried to decode this using this code (there is more data in other columns, text is in 3rd column):

with open('data.csv','rt',encoding='utf-8') as f:
    reader = csv.reader(f,delimiter=',')
    for row in reader:
        print(row[3])

But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!

How can I decode the text data?

Edit: Here's a sample line from the csv file:

67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6  | @abcde',52,18

Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.

martineau
  • 119,623
  • 25
  • 170
  • 301
gitmorty
  • 263
  • 1
  • 2
  • 8
  • Please show at least one complete line from your file exactly as it appears if you open that file in a text editor. We can't reproduce your problem without both your code and your data. – BoarGules Dec 10 '17 at 17:23
  • I'm sorry. I have added an example. Save that line in a file with .csv extension. – gitmorty Dec 10 '17 at 17:49
  • So the csv file literally has strings in it which are represented with a `b` prefix on them, like the `b'Virginia...'`? – martineau Dec 10 '17 at 18:08
  • 3
    Whatever produced that CSV is broken and should be repaired. – tripleee Dec 10 '17 at 18:10

3 Answers3

15

The easiest way is as below. Try it out.

import csv
from io import StringIO

byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
7

If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as @Ry suggested — although I would use it in a slightly different manner, as shown below.

This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.

Note that this doesn't require reading the entire CSV file into memory.

import ast
import csv


def _parse_bytes(field):
    """Convert string represented in Python byte-string literal b'' syntax into
    a decoded character string - otherwise return it unchanged.
    """
    result = field
    try:
        result = ast.literal_eval(field)
    finally:
        return result.decode() if isinstance(result, bytes) else result


def my_csv_reader(filename, /, **kwargs):
    with open(filename, 'r', newline='') as file:
        for row in csv.reader(file, **kwargs):
            yield [_parse_bytes(field) for field in row]


reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
    print(row)
martineau
  • 119,623
  • 25
  • 170
  • 301
  • Thank you. This does solve the case above, but I don't feel comfortable using eval(). It even fails on my file, as it has header strings. – gitmorty Dec 10 '17 at 19:39
  • gitmorty: I think @Ryan's idea of using `ast.literal_eval()` instead of `eval()` is a good one and have incorporated the basic idea into my own answer—which I think addresses both the issues you mentioned in your comment. – martineau Dec 10 '17 at 20:40
2

You can use ast.literal_eval to convert the incorrect fields back to bytes safely:

import ast


def _parse_bytes(bytes_repr):
    result = ast.literal_eval(bytes_repr)

    if not isinstance(result, bytes):
        raise ValueError("Malformed bytes repr")

    return result
Ry-
  • 218,210
  • 55
  • 464
  • 476