0

I'm trying to use a proposed solution in response to a question I recently asked on StackOverflow: (Drop rows based on specific conditions on strings).

Given the dataframe below, I'm trying to drop only rows for which the user_message string has less than five words or that are not written in English:

username user_message
Polop I love this picture, which is very beautiful
Artil Meh
Artingo Es un cuadro preciosa, me recuerda a mi infancia.
Zona I like it
Soi Yuck, to say I hate it would be a euphemism
Iyu NaN

Here's below the corresponding code:

import pandas as pd
from langdetect import detect
df = pd.read_csv("sample.csv")
m1 = df.index[df['user_message'].str.split(' ').str.len() > 5]
df_filtered = df.loc[m1]
del m1
del df
m2 = df_filtered['user_message'].str.isspace() 
df_filtered = df_filtered.loc[ m2 == False].reset_index(drop=True) 
del m2
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5 else '').eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)

The issue is that I have over 12 million rows. I heard that using a SQL Database would be more efficient for this type of operation, but I know virtually nothing on SQL. Does it exist a solution to speed up the process? I tried modin, but it was not conclusive.

Thank you!

  • For sql you need to create a sql database and insert all the data. – Epsi95 Jan 25 '21 at 10:49
  • Yes, but like as said, I'm really not familiar with SQL. Is there any pandas solution that could be faster than my script? – Artengo Polienko Jan 25 '21 at 10:50
  • You want to speed it up. Why? Speeding up things without need is called premature optimization. How long does it take with CSV? How large is the CSV file? Do you use SSD or HDD? How long / what time is acceptable? Given you need 2 days to set up a SQL server, maybe the script will do it in that time. How often do you need to do that task? It looks like a one time action to me. How much RAM does your PC have? – Thomas Weller Jan 25 '21 at 10:55
  • in the 3º line of the code snippet its show that is a csv you didnt forgot to mention nothing – Avatazjoe Jan 25 '21 at 10:55
  • Have you run the script? If you have already processed the data, then why do you need it again? What is the time that it is taking? – Epsi95 Jan 25 '21 at 10:56
  • @Epsi95 This is the thing. I don't know since I've let it run during 24 hours and it was still not finished (and my ram (16Go) was maxing out). I'm actually not worried about the time it might take, but rather on the fact that the script is not progressing (I've put print check, and the first step (m1 = df.index[df['user_message'].str.split(' ').str.len() > 5]) was still not reached after 24 hours). – Artengo Polienko Jan 25 '21 at 11:00
  • 2
    If you want to work with data, you *have* to learn to use SQL and databases. 12M rows isn't a lot of data. Setting up a database server shouldn't take 2 hours, much less 2 days. The indexes used by a database can accelerate queries by orders of magnitude. On the other hand. bad queries will be slow no matter what you do. Trying to split 12M rows and check the contents *each time* will result in big delays *and* prevent you from using indexes on the fields. Loading 12M rows only to filter by row will end up wasting all time loading the data instead of filtering it – Panagiotis Kanavos Jan 25 '21 at 11:03
  • `detect` may cost lots of time, try another solution, like regexp, judge if it contains `[a-zA-Z]` – Ferris Jan 25 '21 at 11:06
  • 1
    You'd get better performance if you read the file as a stream, processing each row as it arrives, and at the very least, extracting the name parts. This way you'd produce results at the same time you loaded the data. Even if you use a database, you'll have to split the data if you want to be able to use indexes to accelerate queries. – Panagiotis Kanavos Jan 25 '21 at 11:06
  • If you use a pipeline of threads to read and process the data, you can have one thread read the file, another split/parse the rows, another filter theme etc. – Panagiotis Kanavos Jan 25 '21 at 11:07

1 Answers1

1

can you try this. The and will act for short circuit evaluation and both conditions will be checked in a single iteration.

import pandas as pd
from langdetect import detect #pip install langdetect

def cusom_detect(x):
    try:
        return detect(x)=='en'
    except:
        return False

df_out = df[df['user_message'].apply(lambda x: (len(x.split(' ')) >= 5) and cusom_detect(x))]
df_out.to_csv('output.csv')

Using pandarallel @https://github.com/nalepae/pandarallel

import pandas as pd
from langdetect import detect #pip install langdetect
from pandarallel import pandarallel #pip install pandarallel

pandarallel.initialize()

def cusom_detect(x):
    try:
        return detect(x)=='en'
    except:
        return False

df_out = df[df['user_message'].parallel_apply(lambda x: (len(x.split(' ')) >= 5) and cusom_detect(x))]
df_out.to_csv('output.csv')
Epsi95
  • 8,832
  • 1
  • 16
  • 34
  • Thank you for your answer. The only issue is that detect might throw an error if the string it processes is empty (thus why I did m2 = df_filtered['user_message'].str.isspace() before trying to detect the language). – Artengo Polienko Jan 25 '21 at 11:25
  • I just did the df_filtered['user_message'].str.isspace() first, which was quite fast. I'm going to try your solution. – Artengo Polienko Jan 25 '21 at 11:26
  • 1
    you don't need to. `(len(x.split(' ')) >= 5)` this condition is checked first. Since for `space` this condition is False, `(len(x.split(' ')) >= 5)` is not evaluated. – Epsi95 Jan 25 '21 at 11:28
  • Sorry, I think you already made sure of that with (len(x.split(' ')) >= 5) – Artengo Polienko Jan 25 '21 at 11:28
  • 1
    I'm trying your answer right now. At least for now, my ram is not bloated like it uses to be with my script aha! I'll let you know if it worked, but thank you a lot for helping me. – Artengo Polienko Jan 25 '21 at 11:35
  • After 2 hours, detect threw me the error "No features in text", which is strange since all empty strings were removed. – Artengo Polienko Jan 25 '21 at 14:24
  • 1
    Oh so sad. What do you intend to do in that case? Ignore? Let me wrap it in a try except. – Epsi95 Jan 25 '21 at 14:28
  • Just drop them as well would be the best thing to do. Thank you! – Artengo Polienko Jan 25 '21 at 14:32
  • Just found one package `pandarallel`. Let me incorporate that. You can check to see if it does any speed imporvement. – Epsi95 Jan 25 '21 at 14:38
  • I didn't have Linux at that time, so I cannot say something about pandarallel, but your code worked (~took 16h). Thank you very much! – Artengo Polienko Feb 04 '21 at 13:57