Properly encoding sc.textFile data (python 2.7)

Question

My CSV was originally created by Excel. Anticipating encoding anomalies, I opened and re-saved the file with UTF-8 BOM encoding using Sublime Text.

Imported into the notebook:

filepath = "file:///Volumes/PASSPORT/Inserts/IMAGETRAC/csv/universe_wcsv.csv"
uverse = sc.textFile(filepath)
header = uverse.first()
data = uverse.filter(lambda x:x<>header)

Formatted my fields:

fields = header.replace(" ", "_").replace("/", "_").split(",")

Structured the data:

import csv
from StringIO import StringIO
from collections import namedtuple

Products = namedtuple("Products", fields, verbose=True)

def parse(row):
    reader = csv.reader(StringIO(row))
    row = reader.next()
    return Products(*row)

products = data.map(parse)

If I then do products.first(), I get the first record as expected. However, if I want to, say, see the count by brand and so run:

products.map(lambda x: x.brand).countByValue()

I still get an UnicodeEncodeError related Py4JJavaError:

File "<ipython-input-18-4cc0cb8c6fe7>", line 3, in parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in    
position 125: ordinal not in range(128)

How can I fix this code?

You can check this question answer: https://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python — C_codio, Jun 04 '19 at 07:06

score 1 · Accepted Answer · edited May 23 '17 at 12:17

csv module in legacy Python versions doesn't support Unicode input. Personally I would recommend using Spark csv data source:

df = spark.read.option("header", "true").csv(filepath)
fields = [c.strip().replace(" ", "_").replace("/", "_") for c in df.columns]
df.toDF(*fields).rdd

For most applications Row objects should work as well as namedtuple (it extends tuple and provides similar attribute getters) but you can easily follow convert one into another.

You could also try reading data as without decoding:

uverse = sc.textFile(filepath, use_unicode=False)

and decoding fields manually after initial parsing:

(data
    .map(parse)
    .map(lambda prod: Products(*[x.decode("utf-8") for x in prod])))

Related question Reading a UTF8 CSV file with Python

Properly encoding sc.textFile data (python 2.7)

1 Answers1