UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

Question

I got the encoding error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

at the following python (pyspark) code, where row is the data frame row:

def rowToLine(row):
  line = str(row[0]).strip()
  columnNum = 44
  for k in xrange(1, columnNum):
    line = line + "\t"
    line = line + str(row[k]).strip()  # encoding error here
  return line

I also tried the join below:

def rowToLine(row):
  s = "\t"
  return s.join(row)

but some values of the row is int, so I got errors:

TypeError: sequence item 19: expected string or Unicode, int found

Does anyone know how to fix this? Thanks!

@Keven, I looked into the questioned you mentioned, but it is not clear to me how to convert row[k] to string without using str. Anything suggestions? — Edamame, Jun 06 '16 at 22:48
There is not enough context to say. Do you want to output UTF-8? ISO-8859-1? Is your data even textual to begin with? — Kevin, Jun 06 '16 at 22:51
Side note: You want to be using [`str.join()`](https://docs.python.org/2/library/stdtypes.html#str.join). It has better performance than the code you're using now. — Kevin, Jun 06 '16 at 22:55
Also, return "\t".join(str(cell).strip() for cell in row[:44]) can replace the entire function. (I think. Didn't try it.) — user3757614, Jun 06 '16 at 22:56
@Keven and user3757614: I tried join as well (see modified question above), but got errors as some value in the row is integer. Any suggestion? "UTF-8" would work. Thanks! — Edamame, Jun 06 '16 at 23:00
@user3757614: "\t".join(str(cell).strip() I believe the str(cell) part will cause the encode error... — Edamame, Jun 06 '16 at 23:02
Well, the real question is what do you want it to return if there are unicode characters present? You can either return a unicode string, (which may break if you use it for other things) or you can strip out the unicode characters. (Which may break if you use it for other things.) — user3757614, Jun 06 '16 at 23:09
Don't call str on the unicode, encode the data with `row[k].encode("utf-8")` or just use the unicode, when you call str you are trying to encode to ascii which is obviously going to error for any non ascii characters, where is this data coming from? — Padraic Cunningham, Jun 06 '16 at 23:12
@PadraicCunningham I did: "\t".join(x.encode("utf-8") for x in row) however, I got the error: AttributeError: 'int' object has no attribute 'encode' because some element in row is an int. Is there a way to fix this? — Edamame, Jun 06 '16 at 23:40
`"\t".join([ x.encode("utf-8") if isinstance(x, basestring) else x for x in row])` — Padraic Cunningham, Jun 06 '16 at 23:42
@PadraicCunningham is the [ ] required ? I mean the [ ] enclose the "x.encode("utf-8") if isinstance(x, basestring) else x for x in row". Thanks! — Edamame, Jun 06 '16 at 23:52
It is faster to use a list comp than a generator expression, a list is built regardless so there is no advantage to using a generator here at all — Padraic Cunningham, Jun 07 '16 at 00:03
Is there a **good** reason for you not using Python 3 for this task? — Antti Haapala -- Слава Україні, Jun 07 '16 at 05:02
I would need to convince the entire team and change the production system to python 3, which is not small task. Python 3 is a good idea, but would take sometime to migrate. — Edamame, Jun 07 '16 at 15:47

score 1 · Answer 1 · answered Jun 07 '16 at 03:57

Thanks for everyone's suggestions!

I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.

def rowToLine(row):
  s = "\t"
  return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

1 Answers1