0

I have two pandas dataframes: One with premium customers, df_premium_customer and one with all sold items, df_sold, that has as columns "customerID"(containing the ID's of premium customers as well as others),"ArticleID", "Date"and several others.

This is how df_premium_customer looks

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
    border: 1px solid black;
}
</style>
</head>
<body>

<h2>Bordered Table</h2>
<p>Use the CSS border property to add a border to the table.</p>

<table style="width:100%">
  <tr>
    <th>Premium_CustomerID</th>
  </tr>
  <tr>
    <td>34674324</td>
  </tr>
  <tr>
    <td>18634345</td>
  </tr>
  <tr>
    <td>99744336</td>
  </tr>
</table>

</body>
</html>

and this is df_sold looks

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
    border: 1px solid black;
}
</style>
</head>
<body>

<h2>Bordered Table</h2>
<p>Use the CSS border property to add a border to the table.</p>

<table style="width:100%">
  <tr>
    <th>CustimerID</th>
    <th>ArticleID</th> 
    <th>Date</th>
  </tr>
  <tr>
    <td>34674324</td>
    <td>3467434</td>
    <td>20140302</td>
  </tr>
  <tr>
    <td>98674342</td>
    <td>3454234</td>
    <td>20140822</td>
  </tr>
  <tr>
    <td>74644334</td>
    <td>4444434</td>
    <td>20150321</td>
  </tr>
</table>

</body>
</html>

For each customer I need to make a datastructure (preliminarily I chose a dict), that shows what has been sold to each premium customer.

So far I'm using the following Python 3 code:

sold_to_customer = {}
for customer in df_premium_customer["CustomerID"]: 
    #generate the list of indexes of this this customers appears in df_sold
    cust_index = df_sold.index[df_sold['CustomerID'] == customer].tolist()

    #add this customers as key to the dict                              
    sold_to_customer[customer] = []  

    for ind in cust_index:  
        #add the name of the things he bought,when, and for how much as values to this key     
        sold_to_customer[customer].append(list(df_sold[ind][["ArticleID","Date"]]))

This is way to slow!

Letting it run for a bit and extrapolating it would need 16 hours to complete, since I have 300k premium customers and several millions rows of entries in the sold items dataframe.

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
l7ll7
  • 1,309
  • 1
  • 10
  • 20
  • 1
    Can you add some data sample? – jezrael Oct 18 '18 at 07:51
  • @jezrael is there an easier way than adding a screenshot (which I can' to right now)? I don't think stackoverflow allows the creation of tables...? – l7ll7 Oct 18 '18 at 07:52
  • better is copy data in text, best is use [pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jezrael Oct 18 '18 at 07:53
  • 1
    Btw, maybe need `merge` - `df_premium_customer.merge(df_sold, on="CustomerID")` – jezrael Oct 18 '18 at 07:54
  • @jezrael until I figure the pandas examples out, I inserted quickly some HTML tables – l7ll7 Oct 18 '18 at 07:59
  • What is expected output? Emopty lists are necessary too? – jezrael Oct 18 '18 at 08:14
  • @jezrael Yes, empty lists would be necessary too (which can happen if the premium customers bought nothing). I couldn't figure out yet, whether your approach takes that into account. – l7ll7 Oct 18 '18 at 08:47

1 Answers1

1

I believe that your issue is coming from pandas. In general, pandas is very slow. You might get some speedup by using merge or groupby method, but i'm not even sure. I believe one easy way to get a speedup is to do it all in numpy. I think the line

cust_index = df_sold.index[df_sold['CustomerID'] == customer].tolist()

costs you a lot, since you do it for every customer.

What you could do, is create a dictionary with all the premium customer ids , and go through all your data. To go through all your data, you could use a for loop, which would be still slow, but i believe faster than what you did with pandas.

sold_to_customer = {}
for customer in df_premium_customer["CustomerID"]: 
    #Initialize the dict
    sold_to_customer[customer] = []
data = df_sold.values
for i,j,k in data:
    sold_to_customer[i].append([j,k])

This makes you go through your data only once, and since the access to a dict is supposed to be fast, you should be good to go. Let me know if this speed things up, and it the speed is enough or it should still be optimized.

Statistic Dean
  • 4,861
  • 7
  • 22
  • 46
  • I checked out your approach too. Are you sure this is numpy related though? because at no point I actually need any numpy methods. I think this approach also need to be optimized more. already `for item in df_sold.values: print(item[1],item[2],item[3])` just computes and computes without outputting anything..... :( – l7ll7 Oct 18 '18 at 09:08
  • I think the problem is with `df_sold.values`, since `df_sold` contains millions of entries – l7ll7 Oct 18 '18 at 09:09
  • Did you try to track the progress on this version to see how much time it would take? I'm pretty sure this is faster than your original method, since you used to go through your whole dataset for each customer, and now it will only do it once. – Statistic Dean Oct 18 '18 at 09:11
  • You can track the progress by importing tqdm and replacing data by tqdm(data) in the second for loop. – Statistic Dean Oct 18 '18 at 09:13
  • I'm afraid your approach has a flaw: there are customers in `data` that are not premium customers, as mentioned. So when executing `sold_to_customer[i].append([j,k])` and `i` happens to be a customer that is not premium, this key will not be present in `sold_to_customer`, so will cause an error – l7ll7 Oct 21 '18 at 08:48
  • Even with that error, which I fixed, when loading really massive datasets, your approach is fastest. this is actually strange to me, since I would have expected pandas builtin `merge` function to be faster than picking the items from the dataset in the way you did. do you know why your approach is faster than merge? – l7ll7 Oct 21 '18 at 09:39
  • I can't really give you a full explanation, because i don't know it. I just know from experience that pandas is very slow if you're not careful. – Statistic Dean Oct 22 '18 at 08:05