0

I'm trying to make sure there is any string issue in one column called Comment from my dataframe headlamp

The reason is because I'm trying to export the dataframe later to excel by using .to_excel() and the unicode error is always raised.

I have read a lot of materials online and also here to solve this issue, however, I couldn't manage it so far. I tried to solve by using the encode() like the code below, however, I still having the same issue.

headlamp = part_dataframe(ro, 'PN 3D', '921')
headlamp['Comment'] = headlamp.Comment.apply(lambda x: x.encode('ascii', 
'ignore'))
headlamp['word'] = headlamp.Comment.str.split().apply(lambda x: 
pd.value_counts(x).to_dict())
len(headlamp)

Error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-57-29454fde650e> in <module>()
  1 headlamp = part_dataframe(ro, 'PN 3D', '921')
----> 2 headlamp['Comment'] = headlamp.Comment.apply(lambda x: 
x.encode('ascii', 'ignore'))
  3 headlamp['word'] = headlamp.Comment.str.split().apply(lambda x: 
  4 pd.value_counts(x).to_dict())
  5 len(headlamp)

C:\Users\Rafael\Anaconda2\envs\gl-env\lib\site-
packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, 
**kwds)
2218         else:
2219             values = self.asobject
-> 2220             mapped = lib.map_infer(values, f, convert=convert_dtype)
2221 
2222         if len(mapped) and isinstance(mapped[0], Series):

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:62658)()

<ipython-input-57-29454fde650e> in <lambda>(x)
  1 headlamp = part_dataframe(ro, 'PN 3D', '921')
----> 2 headlamp['Comment'] = headlamp.Comment.apply(lambda x: 
x.encode('ascii', 'ignore'))
  3 headlamp['word'] = headlamp.Comment.str.split().apply(lambda x: 
  4 pd.value_counts(x).to_dict())
  5 len(headlamp)

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 71: 
 ordinal not in range(128)

I'm complete lost on this matter, therefore, any help will be highly appreciated.

I'm using Jupyter Ipython

  • Is this Python 2? Also, please show the full traceback, so we can see at which line the exception is raised. – lenz Aug 12 '17 at 22:02
  • Note that the exception is related to *de*coding, so the `encode` method itself isn't raising it. However, if this is Python 2, an implicit decoding step might be involved (automatic coercion from `str` to `unicode`). – lenz Aug 12 '17 at 22:05
  • I had update the question with the full traceback, regarding your 2 comment, could explain me better please? My python is 2.7 – Rafael Rodrigues Santos Aug 12 '17 at 22:10
  • I'm not familiar with pandas' data model, but you could try `x.decode('latin-1').encode('ascii', 'ignore')` in the lambda expression. If you don't know the difference between the `str` and `unicode` types, you need to read up on the topic or switch to Python 3 (where you bump into this kind of problem much less frequently). – lenz Aug 12 '17 at 22:19

1 Answers1

0

0xb4 is the unicode character for the backtick: http://www.fileformat.info/info/unicode/char/00b4/index.htm

It looks like there's a non-ascii character in the input you have. Try encoding it to utf-8 instead and see if that helps.

If you still need it in ascii, you could try this solution: Convert a Unicode string to a string in Python (containing extra symbols)

AetherUnbound
  • 1,714
  • 11
  • 10
  • Thank you for help, it is not necessary to be in ascii, I was just trying different types un order to see which one would solve. I tried both of your suggestiona and still didnt work, do you have any other suggestion? Thank you in advance – Rafael Rodrigues Santos Aug 12 '17 at 20:45
  • Try using the library from other stack overflow link I sent and see if that gives an error! – AetherUnbound Aug 12 '17 at 20:46
  • I have tried some of the sugestions on that link and I couldnt make it work. To be honest I dont even if I'm doing it right. Can you help me?, How do you would write my code that I wrote in my question in order to solve it according to your suggestion – Rafael Rodrigues Santos Aug 12 '17 at 21:03
  • `U+00B4` is the acute accent, not the backtick (which is `U+0060` and is contained in ASCII). – lenz Aug 12 '17 at 22:11