I am attempting to generate more variables for my dataset. My data is stored in multiple files, and using pandas I can only read a single file at once because of the memory limitations. Each csv file has the data for a single month and goes something like this:
Index Date Sender Recipient Quantity Type
------------------------------------------------------------------------
79XT 26-03-19 Adam Tiffany 72 Box
57ZY 14-03-19 Josh Ross 13 Snack
29UQ 19-03-19 Adam Alex 60 Fruit
56PY 06-03-19 Lucy Alex 29 Book
41BR 28-03-19 Josh Steve 33 Snack
Now I am trying to generate more feature for each row based on the history of each sender and join these features to the dataframe. For example:
Index Date Sender Recipient Quantity Type Days Since Days Since Cumulative Quantity Increase First Shipment
Previous Shipment First Shipment Quantity from Previous Shipment to This Recipient?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
79XT 26-03-19 Adam Tiffany 72 Box 7 62 1792 12 0
57ZY 14-03-19 Josh Ross 13 Snack NaN NaN 13 NaN 1
29UQ 19-03-19 Adam Alex 60 Fruit 5 55 1730 -7 1
56PY 06-03-19 Lucy Alex 29 Book 23 32 88 -4 0
41BR 28-03-19 Josh Steve 33 Snack 14 14 46 20 1
As you can see from the desired dataframe above, the new variables are generated based on the sender's previous observations. What is the least computationally expensive way of generating such features? I will need to obtain information from all my monthly csv files to gather such data. There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. I am aware of dask and dask distributed, but I want to find out if there is a simpler way for me to implement what I am trying to do.