0

I'm trying to create a dataframe where columns relate to the ID of sold items and the row indices are IDs of the customers who bought those items. The cells should show how much every customer bought of every item. To get this information I read CSV file containing a row for every transaction made by customers.

The file is parsed into the frame_ variable. I retrieve the customer and article IDs using the unique() function on the corresponding columns and use them to create a new dataframe with those IDs as column headers and row indices.

with open(f"{file_path}") as file:
    frame_ = pd.read_csv(file, sep="\t", header=None)
    
    customer_ids = list(frame_[customer_index].unique())
    item_ids = list(frame_[item_index].unique())
    
    frame = pd.DataFrame.from_dict(
        dict.fromkeys(item_ids, dict.fromkeys(customer_ids, 0)))

For the next step I want to iterate over frame_ to check every row for 3 values:

  1. customer ID
  2. item ID
  3. amount of sold items

The amount should be added to the current value at frame.at[customer_id, item_id].

for index, row in frame_.iterrows():
    customer = row[customer_index]
    item = row[item_index]
    amount = abs(float(row[2]))

    frame.at[customer, item] += amount

This part is especially slow due to me using iterrows().
I looked through some questions but because I don't quite know what I'm looking for exactly I couldn't find any solution on how to perform my task more efficiently.

Thank you for your time and any suggestions you can offer.

Edit: The original file and the frame_ dataframe contain around ~2.5mil rows

Edit 2: added excerpt from frame_, "..." contain other information not relevant for this part. Column headers are actually 0-8, "ID", "amount", "itemID" and "customerID" were added for readability:

ID ... amount ... ... itemID ... customerID ...
1  ... -5.0    ... ... 1258   ... 805214     ...
2  ... -10.0   ... ... 3658   ... 798125     ...
3  ... -7.5    ... ... 2056   ... 589012     ...

Edit 3: Expected output would look something like this:

       1258 3658 2056
805214 5.0  0    0
798125 0    10.0 0
589012 0    0    7.5
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Floxyz
  • 103
  • 1
  • 4
  • Have you considered running it in parallel? [dask](https://dask.org/) makes it quite easy. Here is a [great example](https://stackoverflow.com/questions/53561794/iteration-over-a-pandas-df-in-parallel/53923034#53923034) – James Burgess Jun 22 '20 at 07:06
  • Do you really have negative amounts? What is the point of calling `abs`? – DYZ Jun 22 '20 at 07:27
  • they are negative because the sale "removed" them from stock. I call ```abs``` to turn them positive which seems more intuitive to me – Floxyz Jun 22 '20 at 07:34

1 Answers1

1

Start by preparing another column of absolute values of amounts (though I do not fully understand what you need abs and float - aren't your amounts already positive and numeric?):

import numpy as np
frame_["amount1"] = np.abs(frame_["amount"].astype(float))

Then aggregate by customer and item indexes:

frame = frame_.groupby(["customerID", "itemID"])["amount1"].sum()

No explicit iterations needed. You can convert the result to a "wide" format if you want:

frame.unstack().fillna(0)
#itemID      1258  2056  3658
#customerID                  
#589012       0.0   7.5   0.0
#798125       0.0   0.0  10.0
#805214       5.0   0.0   0.0
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • I get the following error: AttributeError: 'DataFrame' object has no attribute 'groupy' And yeah, I probably don't need the ```float``` cast. I just like to be sure sometimes. – Floxyz Jun 22 '20 at 07:36
  • Surely it does not. The typo is all yours. – DYZ Jun 22 '20 at 07:37
  • Right, sorry, my bad – Floxyz Jun 22 '20 at 07:39
  • Thank you very much - It seems to have worked perfectly! – Floxyz Jun 22 '20 at 07:40
  • You may want to double-check the need for `abs`. It feels totally out of place in your problem. – DYZ Jun 22 '20 at 07:41
  • Since my amounts are actually negative (forgot to show this in my examples and will fix it right away) I use ```abs``` to turn them positive. Of course I could leave them negative but I think it's more intuitive to have them be positive – Floxyz Jun 22 '20 at 07:47
  • If all of them are negative, simply negate the results with `-`. – DYZ Jun 22 '20 at 07:49