2

I have a large table of records, about 4 million rows. I need to add an index that counts orders by email address based on the orderID (ascending).

import pandas as pd
df = pd.read_csv('orders.csv', sep=";")

df.dtypes
orderId                  int64
transactionDate          object
revenue                 float64
email                    object
category                 object

rank = df2.groupby("email").orderId.rank(method='first')

When I try to set a variable called rank, the program ran for 90 minutes and took about 5.5 gigs of RAM, but never returned the data. I am just trying to add a column so that for each email (my customerID), I get the order rank based on the orderId. So if I had 3 orders, my first order would have the lowest orderID, etc...the rank restarts for every email.

Thanks for your help.

Jeff

Jeffrey James
  • 225
  • 4
  • 10
  • sorting is slow, it's O(n*log(n)). I think this may also be doing an apply, which is also slow. What's the reason you have to do this? – Andy Hayden Oct 02 '15 at 00:45
  • For each orderid I need to get the order number for the email. Many emails have multiple orders. Open to any solution that will do that. – Jeffrey James Oct 02 '15 at 00:56
  • try `df = df.sort(['email','orderId']); df.groupby('email').cumcount()` - may be more efficient? – chrisb Oct 02 '15 at 01:01
  • @JeffreyJames is the ranking important? Can you get away with just enumerating? I think it depends what's the next step... – Andy Hayden Oct 02 '15 at 01:10
  • @AndyHayden - the next step will be taking the categories from the first time customers, then calculating total lifetime value. So I need to get all the emails in the set of rank=1, then for all those emails, given their starting category, calculated average all time lifetime value. – Jeffrey James Oct 02 '15 at 02:01
  • @chrisb - the sort operation was super fast (10 second) but the df.groupby('email').cumcount() wasn't clear on what it was supposed to return – Jeffrey James Oct 02 '15 at 02:08
  • [cumcount doc](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html). When you say "rank" do you mean size, i.e. user only goes once, or the first entry (which would be, once sorted, `.groupby("email").nth(0)` – Andy Hayden Oct 02 '15 at 02:35
  • @AndyHayden - for each order an email has, i need a column that has that order as being the customer's Nth order. The criteria for the first order could be that the orderid for the customer is the lowest possible orderid related to that customer. By customer I mean 'email' – Jeffrey James Oct 02 '15 at 02:57

1 Answers1

1

Typically in large memory situations you would chunk your data and run each chunk serially. There's lots of good suggestions for doing this:

"Large data" work flows using pandas

Community
  • 1
  • 1
mh00h
  • 1,824
  • 3
  • 25
  • 45