8

I am trying to remove all chinese characters from csv, that contains both latin and chinese characters. Data looks like:

    address                                                 lat
1   农工商超市, Zhangjiang, Pudong New District, 203718       31.204024
2   欧尚, 3057号, Jinke Road, Pudong, 201203, China          31.181804

I need it to look like:

    address                                                 lat
1   , Zhangjiang, Pudong New District, 203718               31.204024
2   , 3057, Jinke Road, Pudong, 201203, China               31.181804

I tried with df.replace(/[^\x00-\x7F]/g, "") and df.replace(/[\u{0080}-\u{FFFF}]/gu,"") but I get error:

    df1.replace([^\x00-\x7F],"");
                 ^
SyntaxError: invalid syntax

need help! thanks

niraj
  • 17,498
  • 4
  • 33
  • 48

3 Answers3

5

you were almost there:

df['address'] = df['address'].str.replace(r'[^\x00-\x7F]+', '')

result:

In [99]: df
Out[99]:
                                     address        lat
0  , Zhangjiang, Pudong New District, 203718  31.204024
1  , 3057, Jinke Road, Pudong, 201203, China  31.181804
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

One way may also be to use filter with string.printable as similar to link:

import string
printable = set(string.printable)
df['address'] = df['address'].apply(lambda row: ''.join(filter(lambda x: x in printable, row)))
df

Result:

                                    address        lat
1  , Zhangjiang, Pudong New District, 203718  31.204024
2  , 3057, Jinke Road, Pudong, 201203, China  31.181804

Or using encode and decode with lambda as similar to link

df['address'] = df['address'].apply(lambda row: row.encode('ascii',errors='ignore').decode())
niraj
  • 17,498
  • 4
  • 33
  • 48
1

An arguably more robust way of doing this if you wanted to limit your character set is to read in a file object with the encoding that you want while ignoring errors

with open('your_csv_file.csv', encoding='ascii', errors='ignore') as infile:
    df = pd.read_csv(infile)
Will Ayd
  • 6,767
  • 2
  • 36
  • 39