0

I am attempting to generate more variables for my dataset. My data is stored in multiple files, and using pandas I can only read a single file at once because of the memory limitations. Each csv file has the data for a single month and goes something like this:


Index     Date          Sender     Recipient     Quantity     Type
------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box
57ZY     14-03-19       Josh       Ross          13           Snack
29UQ     19-03-19       Adam       Alex          60           Fruit
56PY     06-03-19       Lucy       Alex          29           Book
41BR     28-03-19       Josh       Steve         33           Snack

Now I am trying to generate more feature for each row based on the history of each sender and join these features to the dataframe. For example:

Index     Date          Sender     Recipient     Quantity     Type     Days Since          Days Since         Cumulative      Quantity Increase          First Shipment   
                                                                       Previous Shipment   First Shipment     Quantity        from Previous Shipment     to This Recipient?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box      7                   62                 1792            12                         0
57ZY     14-03-19       Josh       Ross          13           Snack    NaN                 NaN                13              NaN                        1  
29UQ     19-03-19       Adam       Alex          60           Fruit    5                   55                 1730            -7                         1
56PY     06-03-19       Lucy       Alex          29           Book     23                  32                 88              -4                         0          
41BR     28-03-19       Josh       Steve         33           Snack    14                  14                 46              20                         1

As you can see from the desired dataframe above, the new variables are generated based on the sender's previous observations. What is the least computationally expensive way of generating such features? I will need to obtain information from all my monthly csv files to gather such data. There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. I am aware of dask and dask distributed, but I want to find out if there is a simpler way for me to implement what I am trying to do.

rpanai
  • 12,515
  • 2
  • 42
  • 64
Pleastry
  • 394
  • 3
  • 19

2 Answers2

2

I see multiple sub-problems in your problem.

df = df.merge(df.groupby("sender").agg(first_occurence_date=("Date","min"))["sender", "first_occurrence_date"], on="sender", how="left")
# Computationally likely inefficient, and doesn't solve multiple file-issue immediately.
  • Computationally efficient solutions: For fast reading, consider using .feather as an efficient storing format. The standard for this changes, so always keep a .csv as backup. You can write a file as feather like this df.to_feather("filename")

Consider factoring your strings with pd.factorize() as described in the Pandas Docs: pd.Factorize() - I have not seen benchmarks on this, but comparing int is faster than string.

Lastly, consider setting up a small sqlite3 database that reads the individual files and stores them. Otherwise, getting the first occurrence will be a pain, because you have to keep overwriting the old value and do a computationally expensive operation multiple times.

Dustin
  • 483
  • 3
  • 13
2

Here I have a different approach. I'd try to

  1. convert all csv to parquet (Eventually see this answer) changing dtypes. At least
df['Date'] = df['Date'].astype("M8")

or

df['Date'] = pd.to_datetime(df['Date'])
  1. Partition again by Sender. I'm assuming all parquet files are on processed folder.
import dask.dataframe as dd
df = dd.read_parquet('processed')
df.to_parquet('processed2', partition_on='Sender')
  1. Now you have many files in every Sender=username you should merge all of them to a single file

  2. You can now create your function for every Sender=username

def fun(df):
    df = df.sort_values("Date")
    df["Day Since Prev Shipment"] = df["Date"].diff().dt.days
    df["Day Since First Shipment"](df["Date"] - df["Date"].min()).dt.days
    df["Cumulative Quantity"] = df["Quantity"].cumsum() 
    df["Quantity difference"] = df["Quantity"].diff()
    grp = df.groupby("Recipient")["Date"].min().reset_index(name="First Shipment")
    df = pd.merge(df, grp, how="left", on="Recipient")
    df["First Shipment"] = (df["Date"]==df["First Shipment"]).astype("int8")
    return df
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • For the third step: is there a way to get rid off partitions so that each `sender` have exactly one parquet file? According to the dask documentation, each partition is written to a separate file: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_parquet – Pleastry Aug 07 '20 at 13:51
  • 1
    Oh sorry I misunderstood what you meant. I think you were referring to the read process just before defining the function in step 4. Thanks – Pleastry Aug 07 '20 at 13:55
  • +1 for teaching about parquet and that partition_on exists. Do you happen to know advantages / disadvantages over using feather? – Dustin Aug 07 '20 at 13:59
  • 1
    Hi Dusting possibly the best answer around is [this](https://stackoverflow.com/a/48097717/4819376). He's the creator of pandas and is heavily involved in both feather and parquet. – rpanai Aug 07 '20 at 14:17
  • @rpanai how does the `df.to_parquet('processed2', partition_on='Sender')` work in principle? I keep getting missing data in the newly generated parquet files. Do the number of observations for the senders have to be fairly close? Some senders have only a couple of observations while others have thousands. – Pleastry Aug 25 '20 at 12:49