I have a large table of records, about 4 million rows. I need to add an index that counts orders by email address based on the orderID (ascending).
import pandas as pd
df = pd.read_csv('orders.csv', sep=";")
df.dtypes
orderId int64
transactionDate object
revenue float64
email object
category object
rank = df2.groupby("email").orderId.rank(method='first')
When I try to set a variable called rank, the program ran for 90 minutes and took about 5.5 gigs of RAM, but never returned the data. I am just trying to add a column so that for each email (my customerID), I get the order rank based on the orderId. So if I had 3 orders, my first order would have the lowest orderID, etc...the rank restarts for every email.
Thanks for your help.
Jeff