I'm trying to use a proposed solution in response to a question I recently asked on StackOverflow: (Drop rows based on specific conditions on strings).
Given the dataframe below, I'm trying to drop only rows for which the user_message string has less than five words or that are not written in English:
username | user_message |
---|---|
Polop | I love this picture, which is very beautiful |
Artil | Meh |
Artingo | Es un cuadro preciosa, me recuerda a mi infancia. |
Zona | I like it |
Soi | Yuck, to say I hate it would be a euphemism |
Iyu | NaN |
Here's below the corresponding code:
import pandas as pd
from langdetect import detect
df = pd.read_csv("sample.csv")
m1 = df.index[df['user_message'].str.split(' ').str.len() > 5]
df_filtered = df.loc[m1]
del m1
del df
m2 = df_filtered['user_message'].str.isspace()
df_filtered = df_filtered.loc[ m2 == False].reset_index(drop=True)
del m2
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5 else '').eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)
The issue is that I have over 12 million rows. I heard that using a SQL Database would be more efficient for this type of operation, but I know virtually nothing on SQL. Does it exist a solution to speed up the process? I tried modin, but it was not conclusive.
Thank you!