0

Given this dataframe (which is a subset of mine):

username user_message
Polop I love this picture, which is very beautiful
Artil Meh
Artingo Es un cuadro preciosa, me recuerda a mi infancia.
Zona I like it
Soi Yuck, to say I hate it would be a euphemism
Iyu NaN

What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:

import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
    count = 0
    if str(review) == "NaN":
        index_list.append(index)
        continue
    for i in review:
        if(i.isspace()):
            count=count+1
    if len(review) == 0:
        index_list.append(index)
    elif review.isspace() is True:
        index_list.append(index)
    elif count < 5:
        index_list.append(index)
    else:
        try:
            detect(review)
            if detect(review) != "en":
                index_list.append(index)
            else:
                pass
        except:
            pass
    index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)

This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?

Thank you.

EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:

m1 = df['user_message'].str.split(' ').str.len() > 5 
m2 = df['user_message'].str.isspace() 
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True) 
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)

I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).

1 Answers1

2

Use:

from textblob import TextBlob
m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language() 
                                          if len(x) >= 3 else '').eq('en') 
m2 = df['user_message'].str.split(' ').str.len() > 5
df_filtered = df.loc[m1 | m2]
print(df_filtered)

  username                                       user_message
0    Polop       I love this picture, which is very beautiful
2  Artingo  Es un cuadro preciosa, me recuerda a mi infancia.
3     Zona                                          I like it
4      Soi        Yuck, to say I hate it would be a euphemism

Check to install

No Module named textblob

ansev
  • 30,322
  • 5
  • 17
  • 31
  • 3
    Nice. Don't forget to drop NA values, e.g. `Iyu NaN` in the original example. – Nick ODell Jan 23 '21 at 22:29
  • So, I think it should work, but for dataframe smaller than mine (which is about ~1 million rows). I'm getting another error: HTTPError: HTTP Error 429: Too Many Requests, caused by m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language() if len(x) >= 3 else '').eq('en') . – Artengo Polienko Jan 23 '21 at 22:45
  • what line raise the error? are you making request? – ansev Jan 23 '21 at 22:47
  • It is the line m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language() if len(x) >= 3 else '').eq('en'). Perhaps it has to check with the server for each message to see if it is in English or not? – Artengo Polienko Jan 23 '21 at 22:48
  • 1
    I just checked on this thread and apparently, it blocks when too many requests are sent: https://stackoverflow.com/questions/56189054/textblob-httperror-http-error-429-too-many-requests. Is it possible to use your solution with GoogleTranslator (https://py-googletrans.readthedocs.io/en/latest/) as suggested in one of the answer? – Artengo Polienko Jan 23 '21 at 22:54