6

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one

**She’s the Hollywood Power Behind Those ...**

I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
Omar
  • 297
  • 5
  • 16
  • 1
    Non-English characters, non-(base)-ASCII characters, or non-Latin characters? By ‘characters’, I presume you mean letters/digits? Please provide an example of the DataFrame, and the expected result. Thank you. – S3DEV Nov 25 '20 at 21:02
  • You *might* file the `string.ascii_letters` and `string.digits` properties helpful here. – S3DEV Nov 25 '20 at 21:07
  • Does this answer your question? [How to check if string is 100% ascii in python 3](https://stackoverflow.com/questions/33004065/how-to-check-if-string-is-100-ascii-in-python-3) – Bill Huang Nov 25 '20 at 21:15

2 Answers2

8

If using Python >= 3.7:

df[df['col'].map(lambda x: x.isascii())]

where col is your target column.


Data:

df = pd.DataFrame({
    'colA': ['**She’s the Hollywood Power Behind Those ...**', 
             'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})

print(df.to_markdown())
|    | colA                                                  |
|---:|:------------------------------------------------------|
|  0 | **She’s the Hollywood Power Behind Those ...** |
|  1 | Hello, world!                                         |
|  2 | Cainã                                                 |
|  3 | another value                                         |
|  4 | test123*                                              |
|  5 | âbc                                                   |

Identifying and filtering strings with non-English characters (see the ASCII printable characters):

df[df.colA.map(lambda x: x.isascii())]

Output:

            colA
1  Hello, world!
3  another value
4       test123*

Original approach was to use a user-defined function like this:

def is_ascii(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35
1

You can use regex to do that.

Installation documentation is here. (just a simple pip install regex)

import re

and use [^a-zA-Z] to filter it.

to break it down: ^: Not a-z: small letter A-Z: Capital letters

Yonus
  • 233
  • 2
  • 12
  • 3
    I’d recommend checking to ensure these patterns (and regex in general) exclude non-(base)-Latin characters with as 'ä', and the like. (Past experience nudges me that they don’t ...). Especially if OP wants to stick to the base ASCII table. (Unclear at the present) – S3DEV Nov 25 '20 at 21:10