When trying to merge two dataframes, I run out of memory. So I decided to look for a way I could do the calculations seperately and then append the results. I came across this answer
However, I don't completely understand it to apply my own functions in it's structure. The details about what I am trying to do is given below.
I am trying to calculate daily logit value from a consumer purchase history to understand the likelihood of a consumer making a purchase given all the offers. The original table looks like the following
a_id b_received brand_id c_consumed type_received date
0 sam soap bill oil edibles 2011-01-01
1 sam oil chris NaN utility 2011-01-02
2 sam brush dan soap grocery 2011-01-03
3 harry oil sam shoes clothing 2011-01-04
4 harry shoes bill oil edibles 2011-01-05
5 alice beer sam eggs breakfast 2011-01-06
6 alice brush chris brush cleaning 2011-01-07
7 alice eggs NaN NaN edibles 2011-01-08
My final goal is to get a table like this which will have only 90 rows (for 90 days)
date logit_value
0 2011-01-01 0.42
1 2011-01-02 0.34
2 2011-01-03 0.24
3 2011-01-04 0.13
4 2011-01-05 0.49
5 2011-01-06 0.54
6 2011-01-07 0.66
7 2011-01-08 0.71
I am now lost in how to use the functions that I used for smaller dataset in this batch function to create the desired final dataset.
The code that I used for the smaller dataset with the explanation of what it's doing is given below.
In order to calculate logit I first need to calculate the output value (0,1) determining whether a user consumer a product received. I used the following code to get an output.
df['output'] = (df.groupby('a_id')
.apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
.reset_index(level='a_id', drop=True))
And the new table looks like this
a_id b_received brand_id c_consumed type_received date output
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
Here 1 represents that the product was consumed by the particular user, and 0 shows it wasn't.
After this I need to groupby dates and apply a function on the output column and store in a new dataframe. Here for each day, the sum of values of output is calculated and then divided by length, and then used to create logit values. The code for that is given below. I plan to use this code
final_df = pd.DataFrame()
def logit(x):
prob = sum(x) / len(x)
logit = np.log(prob / (1 - prob))
return logit
final_df['daily_logit'] = (
sample_dataframe.groupby(['send_time'])['output']
.apply(logit)
.reset_index(level='user_id', drop=True))
which should give me something like this
date logit_value
0 2011-01-01 0.42
1 2011-01-02 0.34
2 2011-01-03 0.24
3 2011-01-04 0.13
4 2011-01-05 0.49
5 2011-01-06 0.54
6 2011-01-07 0.66
7 2011-01-08 0.71
But all this is contingent upon being able to divide the operations in small batches as I mentioned earlier.
Below is an example of a code that from another stackoverflow user that I am trying to use, but haven't been successful so far. The link for this code is here.
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
table3 = pd.DataFrame()
grouped = table1.groupby('scenario')
for _, group in grouped:
temp = pd.merge(group, table2, on='letter')
temp['calc'] = temp['number1'] * temp['number2']
table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
del temp
I realize this is a long question, but I have been stuck here for very long. Any help is appreciated. Please do ask, if I haven't made anything clear in the question.