Dataframe too big to merge, how to do operations in batches?

Question

When trying to merge two dataframes, I run out of memory. So I decided to look for a way I could do the calculations seperately and then append the results. I came across this answer

However, I don't completely understand it to apply my own functions in it's structure. The details about what I am trying to do is given below.

I am trying to calculate daily logit value from a consumer purchase history to understand the likelihood of a consumer making a purchase given all the offers. The original table looks like the following

    a_id b_received brand_id c_consumed type_received       date  
0    sam       soap     bill        oil       edibles 2011-01-01     
1    sam        oil    chris        NaN       utility 2011-01-02          
2    sam      brush      dan       soap       grocery 2011-01-03          
3  harry        oil      sam      shoes      clothing 2011-01-04          
4  harry      shoes     bill        oil       edibles 2011-01-05          
5  alice       beer      sam       eggs     breakfast 2011-01-06          
6  alice      brush    chris      brush      cleaning 2011-01-07          
7  alice       eggs      NaN        NaN       edibles 2011-01-08

My final goal is to get a table like this which will have only 90 rows (for 90 days)

            date  logit_value
   0  2011-01-01       0.42   
   1  2011-01-02       0.34   
   2  2011-01-03       0.24   
   3  2011-01-04       0.13   
   4  2011-01-05       0.49   
   5  2011-01-06       0.54   
   6  2011-01-07       0.66   
   7  2011-01-08       0.71

I am now lost in how to use the functions that I used for smaller dataset in this batch function to create the desired final dataset.

The code that I used for the smaller dataset with the explanation of what it's doing is given below.

In order to calculate logit I first need to calculate the output value (0,1) determining whether a user consumer a product received. I used the following code to get an output.

df['output'] = (df.groupby('a_id')
           .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
           .reset_index(level='a_id', drop=True))

And the new table looks like this

        a_id b_received brand_id c_consumed type_received       date  output  
   0    sam       soap     bill        oil       edibles 2011-01-01       1   
   1    sam        oil    chris        NaN       utility 2011-01-02       1   
   2    sam      brush      dan       soap       grocery 2011-01-03       0   
   3  harry        oil      sam      shoes      clothing 2011-01-04       1   
   4  harry      shoes     bill        oil       edibles 2011-01-05       1   
   5  alice       beer      sam       eggs     breakfast 2011-01-06       0   
   6  alice      brush    chris      brush      cleaning 2011-01-07       1   
   7  alice       eggs      NaN        NaN       edibles 2011-01-08       1

Here 1 represents that the product was consumed by the particular user, and 0 shows it wasn't.

After this I need to groupby dates and apply a function on the output column and store in a new dataframe. Here for each day, the sum of values of output is calculated and then divided by length, and then used to create logit values. The code for that is given below. I plan to use this code

final_df = pd.DataFrame()
def logit(x):
    prob = sum(x) / len(x)
    logit = np.log(prob / (1 - prob))
    return logit


final_df['daily_logit'] = (
    sample_dataframe.groupby(['send_time'])['output']
                    .apply(logit)
                    .reset_index(level='user_id', drop=True))

which should give me something like this

   date             logit_value
0  2011-01-01       0.42   
1  2011-01-02       0.34   
2  2011-01-03       0.24   
3  2011-01-04       0.13   
4  2011-01-05       0.49   
5  2011-01-06       0.54   
6  2011-01-07       0.66   
7  2011-01-08       0.71

But all this is contingent upon being able to divide the operations in small batches as I mentioned earlier.

Below is an example of a code that from another stackoverflow user that I am trying to use, but haven't been successful so far. The link for this code is here.

     # Create input tables
    t1 = {'scenario':[0,0,1,1],
          'letter':['a','b']*2,
          'number1':[10,50,20,30]}

    t2 = {'letter':['a','a','b','b'],
          'number2':[2,5,4,7]}

    table1 = pd.DataFrame(t1)
    table2 = pd.DataFrame(t2)
    table3 = pd.DataFrame()

    grouped = table1.groupby('scenario')

for _, group in grouped: 
    temp = pd.merge(group, table2, on='letter')
    temp['calc'] = temp['number1'] * temp['number2']
    table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
    del temp

I realize this is a long question, but I have been stuck here for very long. Any help is appreciated. Please do ask, if I haven't made anything clear in the question.

If you look at the info for [batch-file] tag you'll see it has nothing to do with python or databases (or even pandas). — , Feb 20 '16 at 22:00
sorry about that!! It was a suggestion that came up automatically — Amit Singh Parihar, Feb 20 '16 at 22:09
I personally got lost at the end. It made sense until "But all of this". Up to this point, you don't seem to use `merge`. After that, you seem to be grouping on some unknown `scenario`, and doing all sorts of stuff. I couldn't connect the part following this to the part preceding it. — Ami Tavory, Feb 20 '16 at 22:53
Sorry about that, the last code is something I found in stackoverflow where the person was dealing with this problem and used it for his problem. I am trying to emulate that code, but haven't been successful so far. I have edited the code for more clarity — Amit Singh Parihar, Feb 20 '16 at 23:03

Dataframe too big to merge, how to do operations in batches?

0 Answers0