Given this dataframe (which is a subset of mine):
username | user_message |
---|---|
Polop | I love this picture, which is very beautiful |
Artil | Meh |
Artingo | Es un cuadro preciosa, me recuerda a mi infancia. |
Zona | I like it |
Soi | Yuck, to say I hate it would be a euphemism |
Iyu | NaN |
What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:
import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
count = 0
if str(review) == "NaN":
index_list.append(index)
continue
for i in review:
if(i.isspace()):
count=count+1
if len(review) == 0:
index_list.append(index)
elif review.isspace() is True:
index_list.append(index)
elif count < 5:
index_list.append(index)
else:
try:
detect(review)
if detect(review) != "en":
index_list.append(index)
else:
pass
except:
pass
index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)
This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?
Thank you.
EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:
m1 = df['user_message'].str.split(' ').str.len() > 5
m2 = df['user_message'].str.isspace()
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True)
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)
I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).