2

I got the encoding error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

at the following python (pyspark) code, where row is the data frame row:

def rowToLine(row):
  line = str(row[0]).strip()
  columnNum = 44
  for k in xrange(1, columnNum):
    line = line + "\t"
    line = line + str(row[k]).strip()  # encoding error here
  return line

I also tried the join below:

def rowToLine(row):
  s = "\t"
  return s.join(row)

but some values of the row is int, so I got errors:

TypeError: sequence item 19: expected string or Unicode, int found

Does anyone know how to fix this? Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320
  • @Keven, I looked into the questioned you mentioned, but it is not clear to me how to convert row[k] to string without using str. Anything suggestions? – Edamame Jun 06 '16 at 22:48
  • There is not enough context to say. Do you want to output UTF-8? ISO-8859-1? Is your data even textual to begin with? – Kevin Jun 06 '16 at 22:51
  • 1
    Side note: You want to be using [`str.join()`](https://docs.python.org/2/library/stdtypes.html#str.join). It has better performance than the code you're using now. – Kevin Jun 06 '16 at 22:55
  • Also, return "\t".join(str(cell).strip() for cell in row[:44]) can replace the entire function. (I think. Didn't try it.) – user3757614 Jun 06 '16 at 22:56
  • @Keven and user3757614: I tried join as well (see modified question above), but got errors as some value in the row is integer. Any suggestion? "UTF-8" would work. Thanks! – Edamame Jun 06 '16 at 23:00
  • @user3757614: "\t".join(str(cell).strip() I believe the str(cell) part will cause the encode error... – Edamame Jun 06 '16 at 23:02
  • Well, the real question is what do you want it to return if there are unicode characters present? You can either return a unicode string, (which may break if you use it for other things) or you can strip out the unicode characters. (Which may break if you use it for other things.) – user3757614 Jun 06 '16 at 23:09
  • Don't call str on the unicode, encode the data with `row[k].encode("utf-8")` or just use the unicode, when you call str you are trying to encode to ascii which is obviously going to error for any non ascii characters, where is this data coming from? – Padraic Cunningham Jun 06 '16 at 23:12
  • @PadraicCunningham I did: "\t".join(x.encode("utf-8") for x in row) however, I got the error: AttributeError: 'int' object has no attribute 'encode' because some element in row is an int. Is there a way to fix this? – Edamame Jun 06 '16 at 23:40
  • 1
    `"\t".join([ x.encode("utf-8") if isinstance(x, basestring) else x for x in row])` – Padraic Cunningham Jun 06 '16 at 23:42
  • @PadraicCunningham is the [ ] required ? I mean the [ ] enclose the "x.encode("utf-8") if isinstance(x, basestring) else x for x in row". Thanks! – Edamame Jun 06 '16 at 23:52
  • 1
    It is faster to use a list comp than a generator expression, a list is built regardless so there is no advantage to using a generator here at all – Padraic Cunningham Jun 07 '16 at 00:03
  • Is there a **good** reason for you not using Python 3 for this task? – Antti Haapala -- Слава Україні Jun 07 '16 at 05:02
  • I would need to convince the entire team and change the production system to python 3, which is not small task. Python 3 is a good idea, but would take sometime to migrate. – Edamame Jun 07 '16 at 15:47

1 Answers1

1

Thanks for everyone's suggestions!

I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.

def rowToLine(row):
  s = "\t"
  return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)
Edamame
  • 23,718
  • 73
  • 186
  • 320