1

I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets. I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:

LangDetectException: No features in text.

Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.

Simple code which worked on sample data but not on original data:

from langdetect import detect
import pandas as pd

df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new) 

How can I check which row is producing error?

Thanks in Advance!

Piyush Ghasiya
  • 515
  • 7
  • 25

1 Answers1

1

Use custom function for return True if function detect failed:

df = pd.read_csv('Sample.csv')

def f(x):
    try:
        detect(x)
        return False
    except:
        return True

s = df.loc[df.text.apply(f), 'text']

Another idea is create new column filled by detect, if failed return NaN, last filtr rows with missing values to df1 and also df_new with new column filled by ouput of function detect:

df = pd.read_csv('Sample.csv')

def f1(x):
    try:
        return detect(x)
    except:
        return np.nan

df['new'] = df.text.apply(f1)

df1 = df[df.new.isna()]

df_new = df[df.new.eq('en')]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can you please tell me where should I place the above custom function in my code?Also, do I have to change in 'x' with 'text'. I am just the beginner. – Piyush Ghasiya Mar 11 '22 at 06:42
  • It is showing the same error : LangDetectException: No features in text. And another error also showing: NameError: name 'LangDetectException' is not defined – Piyush Ghasiya Mar 11 '22 at 06:50
  • @PiyushGhasiya - do you think my code raise error? Or code `df_new = df[df.text.apply(detect).eq('en')]` ? – jezrael Mar 11 '22 at 06:51
  • The earlier code with LangDetectException raised the error. Now (after the edits), the present code I am running, it is still haven't finished running yet. Once it will stop I can tell you about whether the present code is running smoothly or not. – Piyush Ghasiya Mar 11 '22 at 07:01
  • @PiyushGhasiya - It is some large DataFrame? – jezrael Mar 11 '22 at 07:47
  • @PiyushGhasiya - added another solution for run function only once - is returned `NaN` if exception – jezrael Mar 11 '22 at 08:23
  • Thank you for all the help. My dataset is very large (millions rows) and it has been running since 2 hours. Is it usual or there is something wrong? Should I interrupt and run again with the second solution? – Piyush Ghasiya Mar 11 '22 at 09:49
  • It worked. Apparently as the dataset was huge it took several hours but finally it worked. I can't thank you enough. – Piyush Ghasiya Mar 12 '22 at 00:10