5

I have a DataFrame like this:

          Name           first_seen       last_seen
   0      Random guy 1   5/22/2016 18:12  5/22/2016 18:15 
   1      Random guy 2   5/22/2016 12:03  5/22/2016 12:03 
   2      Random guy 3   5/22/2016 21:06  5/22/2016 21:06
   3      Random guy 4   5/22/2016 16:20  5/22/2016 16:20 
   4      Random guy 5   5/22/2016 14:46  5/22/2016 14:46 

Now I have to add a column named Visit_period which takes one of 4 values [morning,afternoon,evening,night] when maximum time spent by that person (row) fell into:

 - morning: 08:00 to 12:00 hrs
 - afternoon: 12:00 to 16:00 hrs
 - evening: 16:00 to 20:00 hrs
 - night: 20:00 to 24:00 hrs

so for above five row out put will be something like this.

   visit_period
        evening
      afternoon
          night
        evening
      afternoon  

I have mentioned maximum time spent because, it may happen that some person's first_seen is at 14:30 and last_seen is 16:21. I would like to assign the value afternoon as he spent 30 mins in afternoon slab and 21 in evening slab. I am using python 2.7.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • And what have you tried? – Julien Jun 16 '16 at 05:18
  • @JulienBernu this is one small part of my whole code which I am having problem with. if you can give some logical map that will also help. –  Jun 16 '16 at 05:26
  • The logical road map is to try something before asking people to do the work for you. If you have already tried something and can't understand what's going wrong with it, then share your code and the specific problem that doesn't make sense. [mcve] [ask] – Julien Jun 16 '16 at 05:41
  • @JulienBernu thanks for your suggestion I will keep this in mind in future. –  Jun 16 '16 at 05:48

2 Answers2

1

You could use apply with the below main_visit_period function that attempts to assign a visit period according to the conditions you outlined:

times = list(range(8, 21, 4))
labels = ['morning', 'afternoon', 'evening', 'night']
periods = dict(zip(times, labels))

which gives:

{8: 'morning', 16: 'evening', 12: 'afternoon', 20: 'night'}

now the function to assign periods:

def period(row):
    visit_start = {'hour': row.first_seen.hour, 'min': row.first_seen.minute} # get hour, min of visit start
    visit_end = {'hour': row.last_seen.hour, 'min': row.last_seen.minute} # get hour, min of visit end
    for period_start, label in periods.items():
        period_end = period_start + 4
        if period_start <= visit_start['hour'] < period_end:
            if period_start <= visit_end['hour'] < period_end or (period_end - visit_start['hour']) * 60 - visit_start['min'] > (visit_end['hour'] - period_end) * 60 + visit_end['min']:
                return label
            else:
                return periods[period_end] # assign label of following period  

and finally .apply():

df['period'] = df.apply(period, axis=1)

to get:

           Name          first_seen           last_seen     period
0  Random guy 1 2016-05-22 18:12:00 2016-05-22 18:15:00    evening
1  Random guy 2 2016-05-22 12:03:00 2016-05-22 12:03:00  afternoon
2  Random guy 3 2016-05-22 21:06:00 2016-05-22 21:06:00      night
3  Random guy 4 2016-05-22 16:20:00 2016-05-22 16:20:00    evening
4  Random guy 5 2016-05-22 14:46:00 2016-05-22 14:46:00  afternoon
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thanks for this one. gonna try. just a question, since I am already running a loop in each row of data frame for other part of code so if I can add this there rather running one more loop, I guess that will be optimal. –  Jun 16 '16 at 05:58
  • May need to add more granularity if you want to compare the timeperiods on a minute basis, just using the `hour` above. If you only wanted to decide based on the latest time, ie `last_seen`, you could use `last_seen.dt.hour` and `map()` the corresponding period. If you have a lot of data, you may want to use a vectorized solution to check for the `timedeltas` as well. – Stefan Jun 16 '16 at 06:04
  • Thanks for suggestion, yes I do want to do it on minute basis and not only on the latest. I have a large dataset , and i didnt understood the last line. will try to figure out. –  Jun 16 '16 at 06:11
  • this is giving error: times = list(range(8, 21, 4)) TypeError: 'list' object is not callable –  Jun 16 '16 at 06:27
  • added the differentiation by minute. do you have a variable called `list` somewhere in your code? http://stackoverflow.com/questions/31087111/typeerror-list-object-is-not-callable-in-python – Stefan Jun 16 '16 at 06:30
  • NO , I dont have how ever there is no need to use list() i guess. coz times = range(8,21,4) itself a list –  Jun 16 '16 at 06:34
  • do I need to put my last_seen and first_seen column in any specific format. Right now this is giving error:AttributeError: ("'str' object has no attribute 'hour'", u'occurred at index 0') –  Jun 16 '16 at 06:38
  • I'm using python3, by the way. Need to convert to `datetime` using `df.last_seen = pd.to_datetime(df.last_seen)`. – Stefan Jun 16 '16 at 06:45
  • is this problem due to python version. `if period_start <= visit_end['hour'] < period_end or (period_end - visit_start['hour']) * 60 - visit_start['min'] > (visit_end - period_end['hour']) * 60 + period_end['min']: TypeError: ("'int' object has no attribute '__getitem__'", u'occurred at index 14')` –  Jun 16 '16 at 06:57
  • http://stackoverflow.com/questions/11480042/python-3-turn-range-to-a-list – Stefan Jun 16 '16 at 06:58
  • What? the datetime conversion? – Stefan Jun 16 '16 at 07:22
  • No that TypeError: ("'int' object has no attribute '__getitem__'", u'occurred at index 14') this error. –  Jun 16 '16 at 07:35
  • on the mentioned line in my 3rd last comment –  Jun 16 '16 at 07:36
  • I mixed up visit and period variables, only visit_start/end are dicts, period_xx are ints. Think it's good now, see update, sorry for the confusion. – Stefan Jun 16 '16 at 07:41
  • still there is a error `return periods[visit_end['hour']] # assign label of following period KeyError: (19, u'occurred at index 14')` probably because `visit_end['hour']` isn't the key. –  Jun 16 '16 at 08:05
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/114818/discussion-between-ranadan-and-stefan). –  Jun 16 '16 at 08:47
  • if you have some time to please give attention to discussion's question. –  Jun 16 '16 at 12:40
0

You can do this:

start  = pd.datetime(2016, 05, 22, 8, 00, 00)
d = ["Morning", "Afternoon", "Evening", "Night"]

def max_spent(fs, ls):

    # Transform your date into timedelta in seconds:
    sr = np.arange(8,25,4)*3600
    fss = (fs-start).seconds
    lss = (ls-start).seconds

    # In which slot would it fit ?
    fs_d = sr.searchsorted(fss)
    ls_d = sr.searchsorted(lss)
    # If it's not the same for both date:
    if fs_d != ls_d:
        # get the one with the biggest amount of time:
        if fss - sr[fs_d - 1] > lss - sr[ls_d - 1]:
            return d[fs_d-1]
        else:
            return d[ls_d-1]
    else:
        return d[ls_d-1]

Then, you just do:

df["visit_period"] = df.apply(lambda x: max_spent(x["first_seen"], x["last_seen"]), axis=1)

and you get:

df 
   Name          first_seen           last_seen visit_period
0  guy1 2016-05-22 18:12:00 2016-05-22 18:15:00      Evening
1  guy2 2016-05-22 12:03:00 2016-05-22 12:03:00    Afternoon
2  guy3 2016-05-22 21:06:00 2016-05-22 21:06:00        Night
3  guy4 2016-05-22 16:20:00 2016-05-22 16:20:00      Evening
4  guy5 2016-05-22 14:46:00 2016-05-22 14:46:00    Afternoon
5  guy6 2016-05-22 14:30:00 2016-05-22 16:21:00    Afternoon

Previous version with pd.cut, better I think if one does not need to assess which columns is best:

# Transform your date into timedelta in seconds:
df["sec"] = map(lambda x: x.seconds, df.last_seen-start)

# Apply Cut on this column:
df["visit_period"] = pd.cut(df.sec, np.arange(8,25,4)*3600, labels=d)

I've done it on last_seen only, but you can make another column with the value corresponding do the maximum time spent and then you can do this on that column.

HTH

jrjc
  • 21,103
  • 9
  • 64
  • 78