0

I have a dataset with 50 thousand rows of text. I want to find which rows have words with accented character and print top 10 rows.

I found this solution but still not able to do it. I'm still new using pandas

df = pd.DataFrame([["I love reading book"],
     ["I'm going to café at 3pm"], 
     ["A façade is exterior of building"]], 
     columns=['text'])

Expected output:

     ["I'm going to café at 3pm"], 
     ["A façade is exterior of building"]
Jimmy
  • 172
  • 1
  • 17
  • Please post 1) sample input data 2) include the code you have tried (not just a link but the code and the error you are receiving) 3) expected output data. Please do not post images as well and you can see here for more detail on how to do all of this: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Oct 07 '20 at 03:36
  • 1
    What is your expected output based on the `df` you shared? – Mayank Porwal Oct 07 '20 at 04:08

3 Answers3

1

try this : df[df['Col_name'].str.contains(r'É|é|Á|á|ó|Ó|ú|Ú|í|Í')].head(10)

Sample output:

enter image description here

Subasri sridhar
  • 809
  • 5
  • 13
1

Encode the strings using ASCII encoding. The "accented" characters are not ASCII characters. When you attempt to encode them, you must choose whether to ignore them or replace them with question marks. If the two methods give different results, then the original string has non-ASCII characters:

accented = df.text.str.encode('ascii', errors='ignore') != \
           df.text.str.encode('ascii', errors='replace')

You can use this boolean mask to extract up to the first ten rows with non-ASCII characters:

df[accented].iloc[:10]
#                  text
#1          I'm going to café at 3pm
#2  A façade is exterior of building

In Python 3.7+, you can use function isascii to the same effect:

accented = ~df.text.str.isascii() # ~ is negation
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • what if someone only wanted to return specific accented characters and did not want to return `façade`? – David Erickson Oct 07 '20 at 05:06
  • @DavidErickson That would be a different problem with a different solution. My answer addresses the OP. – DYZ Oct 07 '20 at 05:08
  • not necessarily :) as the OP has failed to provide expected output. Not trying to start a debate, but I'm just pointing out that either solution could be correct. At the end of the day, it depends on what the expected output would be (which doesn't exist). – David Erickson Oct 07 '20 at 05:09
  • I am using python 3.9. But i am facing this error:: AttributeError: 'StringMethods' object has no attribute 'isascii' – asif abdullah Jan 11 '22 at 04:44
0

Different languages will have different accented characters. If you are just looking for spanish characters for example, you can separate each character with | in a str.contains(). Then filter your dataframe according to that and use .head(10):

import pandas as pd
df = pd.DataFrame(["I love reading book",
     "I'm going to café at 3pm", 
     "A façade is exterior of building"], 
     columns=['text'])
df[df['text'].str.contains('É|é|Á|á|ó|Ó|ú|Ú|í|Í')].head(10)
Out[1]: 
                       text
1  I'm going to café at 3pm

Obviously, there is only a few rows in the sample data, but with more rows this would print the first 10 occurences:

David Erickson
  • 16,433
  • 2
  • 19
  • 35