How to find accented character in pandas by row

Question

I have a dataset with 50 thousand rows of text. I want to find which rows have words with accented character and print top 10 rows.

I found this solution but still not able to do it. I'm still new using pandas

df = pd.DataFrame([["I love reading book"],
     ["I'm going to café at 3pm"], 
     ["A façade is exterior of building"]], 
     columns=['text'])

Expected output:

     ["I'm going to café at 3pm"], 
     ["A façade is exterior of building"]

Please post 1) sample input data 2) include the code you have tried (not just a link but the code and the error you are receiving) 3) expected output data. Please do not post images as well and you can see here for more detail on how to do all of this: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — David Erickson, Oct 07 '20 at 03:36

Subasri sridhar · Answer 1 · 2020-10-07T04:41:17.460

1

try this : df[df['Col_name'].str.contains(r'É|é|Á|á|ó|Ó|ú|Ú|í|Í')].head(10)

Sample output:

edited Oct 07 '20 at 04:41

answered Oct 07 '20 at 03:45

Subasri sridhar

809
5
13

score 1 · Accepted Answer · answered Oct 07 '20 at 04:46

1

Encode the strings using ASCII encoding. The "accented" characters are not ASCII characters. When you attempt to encode them, you must choose whether to ignore them or replace them with question marks. If the two methods give different results, then the original string has non-ASCII characters:

accented = df.text.str.encode('ascii', errors='ignore') != \
           df.text.str.encode('ascii', errors='replace')

You can use this boolean mask to extract up to the first ten rows with non-ASCII characters:

df[accented].iloc[:10]
#                  text
#1          I'm going to café at 3pm
#2  A façade is exterior of building

In Python 3.7+, you can use function isascii to the same effect:

accented = ~df.text.str.isascii() # ~ is negation

answered Oct 07 '20 at 04:46

DYZ

55,249
10
64
93

what if someone only wanted to return specific accented characters and did not want to return `façade`? – David Erickson Oct 07 '20 at 05:06
@DavidErickson That would be a different problem with a different solution. My answer addresses the OP. – DYZ Oct 07 '20 at 05:08
not necessarily :) as the OP has failed to provide expected output. Not trying to start a debate, but I'm just pointing out that either solution could be correct. At the end of the day, it depends on what the expected output would be (which doesn't exist). – David Erickson Oct 07 '20 at 05:09
I am using python 3.9. But i am facing this error:: AttributeError: 'StringMethods' object has no attribute 'isascii' – asif abdullah Jan 11 '22 at 04:44

score 0 · Answer 3 · answered Oct 07 '20 at 04:26

Different languages will have different accented characters. If you are just looking for spanish characters for example, you can separate each character with | in a str.contains(). Then filter your dataframe according to that and use .head(10):

import pandas as pd
df = pd.DataFrame(["I love reading book",
     "I'm going to café at 3pm", 
     "A façade is exterior of building"], 
     columns=['text'])
df[df['text'].str.contains('É|é|Á|á|ó|Ó|ú|Ú|í|Í')].head(10)
Out[1]: 
                       text
1  I'm going to café at 3pm

Obviously, there is only a few rows in the sample data, but with more rows this would print the first 10 occurences:

@DYZ not neccessarily. "Different languages will have different accented characters. If you are just looking for spanish characters for example" — David Erickson, Oct 07 '20 at 05:05

How to find accented character in pandas by row

3 Answers3