0

So I'm trying to convert a cell's value into a usable string.

What I'm trying to do is use the cell value in regex, but it keeps throwing the error

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 26: ordinal not in range(128)

This is just one of many problems, as when I do convert it from a unicode to an ascii value, another cell gives me datetime error (as it is in datetime).

Any advice on how to convert this to a string so that it can be used in regex, since these values are printable.

stovfl
  • 14,998
  • 7
  • 24
  • 51
A. L
  • 11,695
  • 23
  • 85
  • 163

2 Answers2

0

Without looking at specific strings, I would say that before loading data to an xls(x) object it needs to be converted to utf-8 format.

cell_value = cell.decode("utf-8")

For datetime error: You will need to convert string containing datetime to proper datetime format.

There are various posts on SO for that.
Here is a similar question on SO for string to datetime conversion

ascripter
  • 5,665
  • 12
  • 45
  • 68
Anil_M
  • 10,893
  • 6
  • 47
  • 74
  • Unfortunately that didn't work. I got around it by using `unicodedata.normalize('NFKD', cell.value).encode('ascii','ignore')` and as for datetime I just used `str(cell.value)`. Just seems weird that it can `print` the values without issue but can't convert them to normal strings. – A. L Mar 16 '17 at 05:21
0

I don't see the point, why you have to convert from utf-8.

From the unicode docs:
UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.  

You can convert it to ascii, for instance:

u.encode('utf-8') = b"\xea\x80\x80abcd\xde\xb4 u'\\u2019'=\xe2\x80\x99"
u.encode('ascii', 'ignore') = b"abcd u'\\u2019'="
u.encode('ascii', 'replace') = b"?abcd? u'\\u2019'=?"
u.encode('ascii', 'xmlcharrefreplace') = b"&#40960;abcd&#1972; u'\\u2019'=&#8217;"
u.encode('ascii', 'backslashreplace') = b"\\ua000abcd\\u07b4 u'\\u2019'=\\u2019"  

From the re docs:
Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

re.A
re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching.
This is only meaningful for Unicode patterns, and is ignored for byte patterns.

Note that for backward compatibility, the re.U flag still exists 
(as well as its synonym re.UNICODE and its embedded counterpart (?u)), 
but these are redundant in Python 3 since matches are Unicode by default for strings 
(and Unicode matching isn’t allowed for bytes).

Tested with Python:3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • 2
    I will have to try in a few days. I used `unicodedata.normalize('NFKD', cell.value).encode('ascii','ignore')` and it works – A. L Mar 18 '17 at 01:08