Removing rows contains non-english words in Pandas dataframe

Question

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one

**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**

I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

Non-English characters, non-(base)-ASCII characters, or non-Latin characters? By ‘characters’, I presume you mean letters/digits? Please provide an example of the DataFrame, and the expected result. Thank you. — S3DEV, Nov 25 '20 at 21:02
You *might* file the `string.ascii_letters` and `string.digits` properties helpful here. — S3DEV, Nov 25 '20 at 21:07
Does this answer your question? [How to check if string is 100% ascii in python 3](https://stackoverflow.com/questions/33004065/how-to-check-if-string-is-100-ascii-in-python-3) — Bill Huang, Nov 25 '20 at 21:15

Cainã Max Couto-Silva · Accepted Answer · 2020-11-25T21:43:46.620

If using Python >= 3.7:

df[df['col'].map(lambda x: x.isascii())]

where col is your target column.

Data:

df = pd.DataFrame({
    'colA': ['**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**', 
             'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})

print(df.to_markdown())

|    | colA                                                  |
|---:|:------------------------------------------------------|
|  0 | **SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...** |
|  1 | Hello, world!                                         |
|  2 | Cainã                                                 |
|  3 | another value                                         |
|  4 | test123*                                              |
|  5 | âbc                                                   |

Identifying and filtering strings with non-English characters (see the ASCII printable characters):

df[df.colA.map(lambda x: x.isascii())]

Output:

            colA
1  Hello, world!
3  another value
4       test123*

Original approach was to use a user-defined function like this:

def is_ascii(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

Thx @S3DEV. Updated! – Cainã Max Couto-Silva Nov 25 '20 at 21:29 — Cainã Max Couto-Silva, Nov 25 '20 at 21:29

score 1 · Answer 2 · answered Nov 25 '20 at 21:07

1

You can use regex to do that.

Installation documentation is here. (just a simple pip install regex)

import re

and use [^a-zA-Z] to filter it.

to break it down: ^: Not a-z: small letter A-Z: Capital letters

answered Nov 25 '20 at 21:07

Yonus

233
2
12

3

I’d recommend checking to ensure these patterns (and regex in general) exclude non-(base)-Latin characters with as 'ä', and the like. (Past experience nudges me that they don’t ...). Especially if OP wants to stick to the base ASCII table. (Unclear at the present) – S3DEV Nov 25 '20 at 21:10

Removing rows contains non-english words in Pandas dataframe

2 Answers2

Linked