I'm trying to create a dataframe where columns relate to the ID of sold items and the row indices are IDs of the customers who bought those items. The cells should show how much every customer bought of every item. To get this information I read CSV file containing a row for every transaction made by customers.
The file is parsed into the frame_
variable. I retrieve the customer and article IDs using the unique()
function on the corresponding columns and use them to create a new dataframe with those IDs as column headers and row indices.
with open(f"{file_path}") as file:
frame_ = pd.read_csv(file, sep="\t", header=None)
customer_ids = list(frame_[customer_index].unique())
item_ids = list(frame_[item_index].unique())
frame = pd.DataFrame.from_dict(
dict.fromkeys(item_ids, dict.fromkeys(customer_ids, 0)))
For the next step I want to iterate over frame_
to check every row for 3 values:
- customer ID
- item ID
- amount of sold items
The amount should be added to the current value at frame.at[customer_id, item_id]
.
for index, row in frame_.iterrows():
customer = row[customer_index]
item = row[item_index]
amount = abs(float(row[2]))
frame.at[customer, item] += amount
This part is especially slow due to me using iterrows()
.
I looked through some questions but because I don't quite know what I'm looking for exactly I couldn't find any solution on how to perform my task more efficiently.
Thank you for your time and any suggestions you can offer.
Edit: The original file and the frame_
dataframe contain around ~2.5mil rows
Edit 2: added excerpt from frame_
, "..." contain other information not relevant for this part. Column headers are actually 0-8, "ID", "amount", "itemID" and "customerID" were added for readability:
ID ... amount ... ... itemID ... customerID ...
1 ... -5.0 ... ... 1258 ... 805214 ...
2 ... -10.0 ... ... 3658 ... 798125 ...
3 ... -7.5 ... ... 2056 ... 589012 ...
Edit 3: Expected output would look something like this:
1258 3658 2056
805214 5.0 0 0
798125 0 10.0 0
589012 0 0 7.5