Time complexity vs memory usage in pandas using large datasets

Question

I am wondering which would be the most efficient way of creating a column in a pandas dataframe which if an id_row exist in a given list will return 1 or else 0.

I'm currently using a lambda function to apply the result. My problem is that it is taking a long time as my dataframe is around 2M rows and the list it checks into between 200k to 100k items. If I'm not mistaken, this is quadratic time (I'm really not sure though), which in this case runs really slowly give the size of the objects.

The worst is that I have to repeat this bit of code for over a 100 other (different) dataframes.

Here is the function:

 lst_to_add = [1,2,3.......,n] 
 df_table['TEST'] = df_table['id_row'].apply(lambda x : 1  if x i lst_to_add else 0)

I'm wondering how could I make the bit of code (way) more efficient. I thought of a 'divide and conquer' solution using a recursive function perhaps, but I'm really open to any suggestions.

Last thing. I also have constraits with memory hence I'd would prefer a method which takes a little more time but less memory than the alternative (if I had a choice).

score 0 · Answer 1 · answered Aug 06 '19 at 18:53

0

You could do

df_table['TEST'] = (df_table['id_row'].isin(lst_to_add)).astype(int)

This code checks if the id_row variables are in lst_to_add and returns True and False, which the astype(int) converts to 1's and 0's. Since this approach is vectorized (acts on the entire series), it should be faster than using apply.

answered Aug 06 '19 at 18:53

m13op22

2,168
2
16
35

1

thanks for your answer. I have compared the results on a small set (10 000 rows) and the difference is already huge : [lambda : len lst : 4164 len df :10000 AddResultToTDB : 513.489 ms] [isin : len lst : 4164 len df :10000 AddResultToTDB : 5.296 ms] . I'll give the comparison with my real dataset tomorow. – ZaraThoustra Aug 06 '19 at 19:07

score 0 · Answer 2 · answered Aug 06 '19 at 18:56

0

As far as time complexity, your list should be a set, this will make your O(M*N) solution O(N), since set membership tests are constant time instead of linear time (like it is for lists). Then, use the built-in method .isin:

lst_to_add = set(lst_to_add)
df_table['TEST'] = df_table['TEST'].isin(lst_to_add)

You should stick to the boolean type if memory is an issue, and you only want 0 and 1.

answered Aug 06 '19 at 18:56

juanpa.arrivillaga

88,713
10
131
172

this is great ! Thanks for the tip. here are the results now with the set() added to the above solution : [len lst : 4164 len df :10000 AddResultToTDB : 0.000 ms] – ZaraThoustra Aug 06 '19 at 19:14
here is a bit of documentation on the difference between sets and lists. I was only using set() to remove all duplicates, but I know understand that the search in the set() method uses a hash function, explining such a difference in performance https://stackoverflow.com/questions/12354515/what-is-the-difference-between-sets-and-lists-in-python – ZaraThoustra Aug 06 '19 at 19:20
@BorisM well yes, `set` objects are commonly used to get a set of unique values from some iterable, but that isn't their primary use-case. – juanpa.arrivillaga Aug 08 '19 at 18:38

Time complexity vs memory usage in pandas using large datasets

2 Answers2