1

I have a dataframe like (except mine is very large):

user1      user2   day   hour  quantity
-------------------------------------
Alice      Bob      1     12     250
Alice      Bob      1     13     250
Bob        Carol    1     10     20
Alice      Bob      4     1      600
.
.
.

...then suppose I get the following groupby and aggregation (by user1, user2 and day):

user1      user2   day   quantity
---------------------
Alice      Bob      1      500
                    4      600
Bob        Carol    1      20
                    3      100

where the day should go from 0-364 (365 days). What I want is the interquartile range (and median) of the counts for each user for all the days -- except that the zeroes aren't counted.

Life would have been easier if I had explicit zeroes for all excluded days:

user1    user2    day   quantity
---------------------
Alice    Bob      1      500
                  2      0
                  3      0
                  4      600
.....
Bob      Carol    1      20
                  2      0
                  3      100
...

... because then I could do df.reset_index().agg({'quantity':scipy.stats.iqr}) but I'm working with a very large dataframe (the example above is a dummy one), and reindexing with zeroes is just not possible.

I have an idea how to do it: since I know there are 365 days, then I should just pad the rest of the numbers by zeroes:

Alice-Bob: [500,600] + (365-2) * [0]

and get the scipy.stats.iqr (and median) of that. However, this would involve iterating over all user1-user2 pairs. From experience, that takes a lot of time.

Is there any vectorized solution to this? I also have to get the median, too, and I think the same approach should hold.

smci
  • 32,567
  • 20
  • 113
  • 146
irene
  • 2,085
  • 1
  • 22
  • 36
  • do you need for each pairs `user1` and `user2` days from `1` to `265` ? – jezrael May 25 '18 at 12:06
  • What I want is a measure of variation over the number of days. So there is only one IQR per user1-user2 pair, and the values to comprise the distribution are the values of `quantity` for each day (0-364). I also edited my question to indicate that this will help me get the median, too. – irene May 25 '18 at 12:07
  • Could you please share an example of your input data before groupby? – zipa May 25 '18 at 12:25
  • Hi @zipa, added the original dataframe above. – irene May 25 '18 at 12:32
  • The second table is a groupby them aggregate (you summed `quantity`). – smci May 26 '18 at 02:38
  • Honestly your title is grossly misleading: what you're asking for is zero-padding for all the excluded days of year. – smci May 26 '18 at 02:44
  • Hi @smci, yes, it turns out that was what I needed. – irene May 26 '18 at 08:22
  • @irene: I fixed your title but could you please do that in future? Otherwise SO will have lots of questions where the question as asked, the answer and the asker's intent are different things... and that's before we get to future-proofing against all the pandas version changes. – smci May 26 '18 at 10:21
  • Hi @smci, I've already asked the zero-padding question in an earlier post https://stackoverflow.com/questions/50513688/how-to-show-zero-counts-in-pandas-groupby-for-large-dataframes, and no one answered. At the time I posed this question, I was open to several ways of how to solve this. The first answer by zipa also used zero-padding in the dataframe, but he later changed it to adding zero-padded *tuples* due to memory issues. This was the answer I marked as accepted. – irene May 26 '18 at 10:59

1 Answers1

1

To take advantage of zeros without putting them in dataframe you can use something like this:

test = df.groupby(['user1', 'user2', 'day'])['quantity'].mean().reset_index()\
         .groupby(['user1', 'user2'])\
         .agg({'day': lambda x: tuple(x), 'quantity': lambda x: tuple(x)})\
         .reset_index()

def med_from_tuple(row):
    # starts with everything zero, and replaces some with the nonzero values in the dataframe
    z = np.zeros(365)
    np.put(z, row['day'], row['quantity'])
    return np.median(z)

test['example'] = test.apply(lambda x: med_from_tuple(x), axis=1)

This would create the median of quantity as if there were zeros in the dataframe.

test
#   user1  user2     day    quantity   example
#0  Alice    Bob  (1, 4)  (250, 600)       0.0
#1    Bob  Carol    (1,)       (20,)       0.0
zipa
  • 27,316
  • 6
  • 40
  • 58
  • Hi @zipa! Could you please clarify what `baseData` does? I get an error doing it. Thanks. – irene May 25 '18 at 13:09
  • Were you looking for a dataframe with columns `user1, user2, day` with all possible combinations? – irene May 25 '18 at 13:15
  • @irene Yeah, that is what `baseDf` should provide to you – zipa May 25 '18 at 13:16
  • Hi @zipa, thanks for this. I actually thought of something similar earlier (make a null dataframe with all possible values, then merge, then fill the na's with zero), but for some reason it is taking so much memory (>130GB!) in the cluster so I had to find another way... – irene May 25 '18 at 13:17
  • @irene Understood, very large product, i think I have something on my mind, will post once I test it. – zipa May 25 '18 at 13:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/171782/discussion-between-irene-and-zipa). – irene May 25 '18 at 13:27
  • This worked on my large dataframe! I had to process it in chunks, but it worked within memory constraints. Thanks a lot!!! – irene May 25 '18 at 19:17