0

I'm trying to filter columns of DataFrame with a unicode regex. I need the code to be compatible with both python2 and python3.

df.filter(regex=u'证券代码')

The code throws error in python2

  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2469, in filter
    axis=axis_name)
  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1838, in select
    np.asarray([bool(crit(label)) for label in axis_values])]
  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2468, in <lambda>
    return self.select(lambda x: matcher.search(str(x)) is not None,
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

So, I write a unit test:

class StrTest(unittest.TestCase):
    def test_str(self):
        str(u'证券代码')

It reports same error.

Any idea about this error? How do I filter DataFrame with a unicode regex?

user1633272
  • 2,007
  • 5
  • 25
  • 48
  • This question is related to your problem: https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 – Craig Apr 16 '17 at 13:43
  • This open bug report for pandas looks like it describes your problem: https://github.com/pandas-dev/pandas/issues/13101 – Craig Apr 16 '17 at 13:53
  • Seem that I could use sys.setdefaultencoding("utf-8") to resolve the problem. But it says to avoid this - http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script – user1633272 Apr 16 '17 at 13:55
  • I tried to set environment PYTHONIOENCODING=UTF-8, but doesn't work. – user1633272 Apr 16 '17 at 14:00
  • I can't reproduce. I'm using Pandas 19.2 and Python 3.6 and both the `filter()` and the `str()` commands work fine. What versions are you using? – Craig Apr 16 '17 at 14:03
  • It happens on Windows, didn't try Linux or Mac. – user1633272 Apr 16 '17 at 14:04
  • I'm on Windows 10. – Craig Apr 16 '17 at 14:05
  • I'm on Windows 10 too, neither cmd nor pycharm work – user1633272 Apr 16 '17 at 14:08
  • The following code works fine for me. Do you get the same error with this? `import pandas as pd; df = pd.DataFrame( {'ascii':range(10), u'证券代码':range(10,20)}); print(df.filter(regex=u'证券代码'))` – Craig Apr 16 '17 at 14:15
  • Yes, still same error. – user1633272 Apr 16 '17 at 14:17

1 Answers1

1

I can only reproduce this problem in Python 2.7. For a Python 2.7 environment, there are several work-arounds:

This is the dataframe that I'm using

# -*- coding: utf-8 -*-
import pandas as pd 

df = pd.DataFrame( {'ascii':range(10), u'证券代码':range(10,20)}); 

1) Slice Notation

Use a regex to directly filter the list of column names and then use standard indexing to select those columns:

import re
matches = [c for c in df.columns if re.search(u'证券代码',c)]
print(df[matches])

Another option for getting the column matches is to use the Python filter function like:

colpattern = re.compile(u'证券代码')
matches = list(filter(colpattern.search, df.columns))

2) DataFrame.select()

You specify a matching function to the .select(). This allows you to specify a regex or any other code to match the column names.

import re
print(df.select(lambda c: re.search(u'证券代码',c), axis=1))

NOTE: For a regex as simple as this, you could use u'证券代码' in c as the criteria and not load the regex library at all.

Craig
  • 4,605
  • 1
  • 18
  • 28