2

Say one has a lookup table summarizing the busy lives of a few people on this planet...

import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime as dt
t=pd.Timestamp

lu = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Jeff','Marissa'],
                                        'feels' : ['charitable','Alcoa envy','Elon envy','like the number 7','sassy'],
                                        'last ate' : [t('20151209'),t('20151201'),t('20151208'),t('20151208'),t('20151209')],
                                        'boxers' : [True,True,True,False,True]})

Say one also knows where these people live and when they did certain things...

af = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Elon','Jeff','Larry','Larry'],
                                        'address' : ['in my computer','moon','internet','mars','cardboard box','autonomous car','every where'],
                                        'sq_ft' : [2,2135,69,84535, 1.32, 54,168],
                                        'forks' : [7,1,2,1,0,np.nan,1]})

rand_dates=[t('20141202'),t('20130804'),t('20120508'),t('20150411'),
                        t('20141209'),t('20091023'),t('20130921'),t('20110102'),
                        t('20130728'),t('20141119'),t('20151024'),t('20130824')]

df = pd.DataFrame({ 'name' : ['Elon','Bill','Larry','Elon','Jeff','Larry','Larry','Bill','Larry','Elon','Marissa','Jeff'],
                                        'activity' : ['slept','tripped','spoke','swam','spooked','liked','whistled','up dog','smiled','donated','grant men paternity leave','fondled'],
                                        'date' : rand_dates})

One could rank these people according to addresses they live at as follows:

af.name.value_counts()

Larry    3
Elon     2
Jeff     1
Bill     1

Need 1: Using the ranking above, how would one create a new "ranked" dataframe composed of information from lookup table lu? Simply put, how does one make Exhibit A?

# Exhibit A
  boxers              feels   last ate     name  addresses
0   True          Elon envy 2015-12-08    Larry          3
1   True         Alcoa envy 2015-12-01     Elon          2
2  False  like the number 7 2015-12-08     Jeff          1
3   True         charitable 2015-12-09     Bill          1

Need 2: Observe the output of the groupby operation that follows. How can one determine the time delta between the oldest and newest dates to rank members of lu according to such time deltas?.. Simply put, how does one get from the groupby to Exhibit D?

df.groupby(['name','date']).size()

name     date      
Bill     2011-01-02    1
         2013-08-04    1
Elon     2014-11-19    1
         2014-12-02    1
         2015-04-11    1
Jeff     2013-08-24    1
         2014-12-09    1
Larry    2009-10-23    1
         2012-05-08    1
         2013-07-28    1
         2013-09-21    1
Marissa  2015-10-24    1

#Exhibit B - Calculate time deltas
name     time_delta
Bill     Timedelta('945 days 00:00:00')
Elon     Timedelta('143 days 00:00:00')
Jeff     Timedelta('472 days 00:00:00')
Larry    Timedelta('1429 days 00:00:00')
Marissa  Timedelta('0 days 00:00:00')

#Exhibit C - Rank time deltas (this is easy)
name     time_delta
Larry    Timedelta('1429 days 00:00:00')
Bill     Timedelta('945 days 00:00:00')
Jeff     Timedelta('472 days 00:00:00')
Elon     Timedelta('143 days 00:00:00')
Marissa  Timedelta('0 days 00:00:00')

#Exhibit D - Add to and re-rank the table built in Exhibit A according to time_delta
  boxers              feels   last ate     name  addresses          time_delta
0   True          Elon envy 2015-12-08    Larry          3  1429 days 00:00:00
1   True         charitable 2015-12-09     Bill          1   945 days 00:00:00
2  False  like the number 7 2015-12-08     Jeff          1   472 days 00:00:00
3   True         Alcoa envy 2015-12-01     Elon          2   143 days 00:00:00
4   True              sassy 2015-12-09  Marissa        NaN     0 days 00:00:00

Prior Research: This so post on getting max values using groupby and transform and this other so post on finding and selecting most frequent data are informative but don't work on series (the result of count_values()) or just trip me up... I've actually gotten the first part to work but the code is bugly and likely inefficient.

Easy Peasy Code Sharing Check out this IPython Notebook that lays everything out. Otherwise, check out the Python 2.7 code here.

Community
  • 1
  • 1
zelusp
  • 3,500
  • 3
  • 31
  • 65

1 Answers1

1

I think you can use join, sort_values. Aggregation in docs.

#join value count to lu dataframe, renaming ans sorting
Exhibit_A = lu.set_index('name').join(af.name.value_counts()).rename(columns={'name': 'addresses'}).sort_values('addresses', ascending=False)
#drop rows with NaN, reset index
print Exhibit_A.dropna().reset_index()

    name boxers              feels   last ate  addresses
0  Larry   True          Elon envy 2015-12-08          3
1   Elon   True         Alcoa envy 2015-12-01          2
2   Bill   True         charitable 2015-12-09          1
3   Jeff  False  like the number 7 2015-12-08          1
#aggregate to min and max date 
g = df.groupby(['name']).agg({'date' : [np.max, np.min]})

#reset columns multiindex
levels = g.columns.levels
labels = g.columns.labels
g.columns = levels[1][labels[1]]

g['time_delta'] = g['amax'] - g['amin']

#drop columns
g = g.drop(['amax', 'amin'], axis=1)

#join to Exhibit_A, sort, reset index
Exhibit_D = Exhibit_A.join(g).sort_values('time_delta', ascending=False).reset_index()
#reorder columns
Exhibit_D = Exhibit_D[['boxers', 'feels', 'last ate', 'name', 'addresses' , 'time_delta' ]]
print Exhibit_D

  boxers              feels   last ate     name  addresses  time_delta
0   True          Elon envy 2015-12-08    Larry          3   1429 days
1   True         charitable 2015-12-09     Bill          1    945 days
2  False  like the number 7 2015-12-08     Jeff          1    472 days
3   True         Alcoa envy 2015-12-01     Elon          2    143 days
4   True              sassy 2015-12-09  Marissa        NaN      0 days
zelusp
  • 3,500
  • 3
  • 31
  • 65
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252