1

I have 600 csv files each file contains around 1500 rows of data. I have to run a function on the every row of data. I have define the function.

def query_prepare(data):
   """function goes here"""
   """here input data is list of single row of dataframe""" 

the above function is perform some function like strip(), replace() based on conditions. Above function takes every single row data as list.

data = ['apple$*7','orange ','bananna','-'].

this is my initial dataframe looklike

         a            b       c          d
  0     apple$*7     orange   bananna    -
  1     apple()*7    flower]  *bananna  -

I checked with the function for one row of data processing it takes around 0.04s. and if I run this on one csv file which contains 1500 row of data it takes almost 1500*0.04s. I have tried with some of the methods....

# normal in built apply function
t = time.time()
a = df.apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)

# time taken 52.519816637039185



# with swifter
t = time.time()
a = df.swifter.allow_dask_on_strings().apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)

# time taken 160.31028127670288


# with pandarallel
pandarallel.initialize()
t = time.time()
a = df.parallel_apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)

# time taken 55.000578

I did everything with my query_prepare function to reduce the time so there are no way to change or modify it. Any other suggestion suggestions?

P.S by the way I'm running it on google colab

EDIT: If we have 1500 row data, split it into 15 then apply the function. can we decrease the time by 15 times if we do something like this?. (I'm sorry I'm not sure its possible or not guide me in a good way)

  • How does your data look like? What is the slowest part in query_prepare? The operations and how the data looks like has quite a bit of effect in what methods you can use to make your code run faster. – Niko Föhr Nov 11 '20 at 06:31
  • It's good that you've reduced all the clutter of your code, but please include enough code and data so that people can actually run your code and help you. – Niko Föhr Nov 11 '20 at 06:34
  • @np8 i guess i can't reduce time with the `query_prepare` function. Is there any way to process pandas row faster or process multiple row at a time? – dark knight Nov 11 '20 at 06:45
  • Could you paste first 10-20 rows of your data? Or dummy data that looks like it. And add code to query_prepare that can be used for benchmarking? – Niko Föhr Nov 11 '20 at 06:52
  • your function should not take each row as a list, this makes things slow. I assume you just have a handful of columns. If so, you should rather create a function that takes a column (as a Series not as a list) and use some vectorized string operations. This should speed up things already tremendously – Quickbeam2k1 Nov 11 '20 at 07:00
  • @np8 i have added the dataframe. basically the `query_prepare` function is format the text in every row – dark knight Nov 11 '20 at 07:01
  • @Quickbeam2k1 can you give me example or any useful links – dark knight Nov 11 '20 at 07:02
  • You could also take a look at the `vaex` library. [`vaex.dataframe.DataFrame.apply`](https://vaex.readthedocs.io/en/latest/api.html#vaex.dataframe.DataFrame.apply). – CypherX Nov 11 '20 at 07:06
  • To me it seems your code is heavily "row-iteration-based", and that you should think about how to code everything without using `.apply()`, see [this answer](https://stackoverflow.com/a/54432584/565489) for a good explanation on `.apply()`'s generally poor performance. – Asmus Nov 11 '20 at 07:12
  • @Asmus thats good read. what about the last part of my question is it possible? – dark knight Nov 11 '20 at 07:36
  • You mean the part about "splitting 1500 rows into 15"? If so, then I must admit I didn't understand that part at all. Is it about parallelising code? Overall, it appears to me you may be trying to optimise the "wrong parts" of your code, but to answer this, you'll need to _provide us with something to work with_. As long as we don't know what either your `data` or your `query_prepare()` look like, it's hard to tell what's going on. Additionally, I'm not sure whether your attempt of parallelising the code makes sense here. Perhaps you could add more information on functions and data? – Asmus Nov 11 '20 at 07:46
  • No, what I asked was if 10 rows take `10s` so the df with 100 rows would take `1000s` If we iterate through each and every row and apply the function on every row. I asked we divide 100 rows into 10 sets so now we have iterate through each row and apply function but as 10 set. So now 1 set takes `100s` if we run 10set at a same time with multiprocessing it would still take around `100s`? – dark knight Nov 11 '20 at 08:22
  • With multiprocessing, you have to (1) prepare your dataframe so it can actually be accessed in parallel and (2) have resources to parallelise on. How many CPU cores you have access to in your Google colab defines (theoretically!) your speedup. Think about it like this: if python is 100% efficient on _one core_, and you efficiently split the task up onto _4 cores_, your speedup is 4x since you now have 4x computing power. Unless you have 10 or more cores available, splitting one job into 10 does not make much sense. Please let us know what your data & function actually look like. – Asmus Nov 11 '20 at 10:18
  • 1
    honestly, with 1500 rows, some basic string processing, 4 columns maybe 40, I wouldn't expect this to take longer than 1 second. Please show what you want to do exactly: first few rows of data (if your sample does not contain the actual data) and the most important your query prepare function. Also, converting the row to a list will slow things down tremendously. I also assume that instead of pandas str functions you"re using python's built-ins, which might be slow as well – Quickbeam2k1 Nov 12 '20 at 07:39

1 Answers1

0

For example you could roughly do the following:

def sanitize_column(s: pd.Series):
    return s.str.strip().str.strip('1234567890()*[]')

then you could do:

df.apply(sanitize_column, axis=0)

with:

df = pd.DataFrame({'a': ['apple7', 'apple()*7'], 'b': ["    asd   ", ']asds89']})

this will give

       a     b
0  apple   asd
1  apple  asds

This should be faster than your solution. For proper benchmarking, we'd need your full solution.

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42