How to use DataFrame.filter with regex containing unicode in Python?

Question

I'm trying to filter columns of DataFrame with a unicode regex. I need the code to be compatible with both python2 and python3.

df.filter(regex=u'证券代码')

The code throws error in python2

  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2469, in filter
    axis=axis_name)
  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1838, in select
    np.asarray([bool(crit(label)) for label in axis_values])]
  File "D:\Applications\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2468, in <lambda>
    return self.select(lambda x: matcher.search(str(x)) is not None,
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

So, I write a unit test:

class StrTest(unittest.TestCase):
    def test_str(self):
        str(u'证券代码')

It reports same error.

Any idea about this error? How do I filter DataFrame with a unicode regex?

This question is related to your problem: https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 — Craig, Apr 16 '17 at 13:43
This open bug report for pandas looks like it describes your problem: https://github.com/pandas-dev/pandas/issues/13101 — Craig, Apr 16 '17 at 13:53
Seem that I could use sys.setdefaultencoding("utf-8") to resolve the problem. But it says to avoid this - http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script — user1633272, Apr 16 '17 at 13:55
I tried to set environment PYTHONIOENCODING=UTF-8, but doesn't work. — user1633272, Apr 16 '17 at 14:00
I can't reproduce. I'm using Pandas 19.2 and Python 3.6 and both the `filter()` and the `str()` commands work fine. What versions are you using? — Craig, Apr 16 '17 at 14:03
The following code works fine for me. Do you get the same error with this? `import pandas as pd; df = pd.DataFrame( {'ascii':range(10), u'证券代码':range(10,20)}); print(df.filter(regex=u'证券代码'))` — Craig, Apr 16 '17 at 14:15

Craig · Accepted Answer · 2017-04-16T15:49:40.337

I can only reproduce this problem in Python 2.7. For a Python 2.7 environment, there are several work-arounds:

This is the dataframe that I'm using

# -*- coding: utf-8 -*-
import pandas as pd 

df = pd.DataFrame( {'ascii':range(10), u'证券代码':range(10,20)});

1) Slice Notation

Use a regex to directly filter the list of column names and then use standard indexing to select those columns:

import re
matches = [c for c in df.columns if re.search(u'证券代码',c)]
print(df[matches])

Another option for getting the column matches is to use the Python filter function like:

colpattern = re.compile(u'证券代码')
matches = list(filter(colpattern.search, df.columns))

2) DataFrame.select()

You specify a matching function to the .select(). This allows you to specify a regex or any other code to match the column names.

import re
print(df.select(lambda c: re.search(u'证券代码',c), axis=1))

NOTE: For a regex as simple as this, you could use u'证券代码' in c as the criteria and not load the regex library at all.

How to use DataFrame.filter with regex containing unicode in Python?

1 Answers1