creating if else statement to changing duplicate values to subsequent value (grouped by multiple columns)

Question

I'm in so far over my head. Could get this with R but no idea with Python (which I'm learning).

So...I have a data frame with participants who are sampled at four times each day for 30 days. The issue is that if the participant filled out two surveys within a certain range, the time_of_day for both could be 2; so there are times when time_of_day has the same value for the same participant and same date (which it shouldn't). I checked with the director of the lab and this is because the inputs weren't time-locked. So to resolve this, I think I'd need to create some kind of ifelse statement, if there is a duplicate to make it the preceding value (if there is a duplicate for 2, make the first value for time_of_day a 1). I don't know how to do this within the lambda/if else statement I created below, but where I'm totally in over my head is that these would need to be grouped by participant and date as well, and I have no idea how to combine an if else statement with dropping duplicates within a lambda statement in Python.

data\
.assign(time_of_day = data['time_of_day'].apply(lambda x: if else )

So for the data below, for time_of_day for the participant 21 for 2019-12-08 date, the 1st observation (currently a 2) would become a 1, and the 3rd observation (currently a 4) would become a 3.

I couldn't find the equivalent of dput() in R, so I've converted a minimal reproducible data frame to a dictionary (which I read about here: Print pandas data frame for reproducible example (equivalent to dput in R))...so just convert it to a dataframe with data = pd.DataFrame.from_dict(df_dict):

df_dict = {'pid': {0: '21',
  1: '21',
  2: '21',
  3: '21',
  4: '21',
  5: '21',
  200: '26',
  201: '26',
  202: '26',
  203: '26',
  204: '26'},
 'datestamp': {0: Timestamp('2019-12-07 21:33:11'),
  1: Timestamp('2019-12-08 13:32:16'),
  2: Timestamp('2019-12-08 13:33:41'),
  3: Timestamp('2019-12-08 19:54:22'),
  4: Timestamp('2019-12-08 19:55:24'),
  5: Timestamp('2019-12-09 12:11:24'),
  200: Timestamp('2020-02-12 20:16:33'),
  201: Timestamp('2020-02-13 08:37:21'),
  202: Timestamp('2020-02-13 13:24:20'),
  203: Timestamp('2020-02-13 17:05:27'),
  204: Timestamp('2020-02-13 20:02:12')},
 'date': {0: datetime.date(2019, 12, 7),
  1: datetime.date(2019, 12, 8),
  2: datetime.date(2019, 12, 8),
  3: datetime.date(2019, 12, 8),
  4: datetime.date(2019, 12, 8),
  5: datetime.date(2019, 12, 9),
  200: datetime.date(2020, 2, 12),
  201: datetime.date(2020, 2, 13),
  202: datetime.date(2020, 2, 13),
  203: datetime.date(2020, 2, 13),
  204: datetime.date(2020, 2, 13)},
 'time_of_day': {0: 4,
  1: 2,
  2: 2,
  3: 4,
  4: 4,
  5: 2,
  200: 4,
  201: 1,
  202: 2,
  203: 3,
  204: 4},
 'depressed': {0: 3,
  1: 3,
  2: 4,
  3: 4,
  4: 4,
  5: 4,
  200: 3,
  201: 3,
  202: 1,
  203: 2,
  204: 2},
 'prev_night_sleep': {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 11.35,
  200: 7.166666666666667,
  201: 10.18333333333333,
  202: 10.18333333333333,
  203: 10.18333333333333,
  204: 10.18333333333333}}

I'm struggling to understand the data and expected output. Can you reformat it and shrink it to the minimum needed to reproduce the problem, then give an example of the desired output. — defladamouse, Dec 02 '21 at 20:17
It's a data frame of 5 observations and six variables, so it's really already been shrunk. I can add expected output. — James, Dec 02 '21 at 20:27
Why for `pid=21` and `date=2019-12-08`, the `time_of_day` vector should not be simply `[1, 2, 3, 4]` (in the order of `datestamp`)? — Corralien, Dec 02 '21 at 20:40
Not sure I fully understand your question, but I think you're assuming that a participant would complete each survey (particularly when it was sent). Often, participants miss visits, don't complete questionnaires; hence missing data. Or, in another scenario, they complete what they should've completed at the second time point at the 3rd timepoints, so it registers (in this case) as 3, when it should've been 2. Ideally, it would've registered all expected timepoints and then would've just filled in as missing (wasn't my design). — James, Dec 02 '21 at 20:47
Currently trying to figure out how to create NAs at the appropriate time slots where data was never completed. — James, Dec 02 '21 at 20:48

score 1 · Answer 1 · answered Dec 02 '21 at 20:55

1

Try:

df['time_of_day'] = df.groupby(['pid', 'date']).cumcount().add(1)
print(df)

    pid           datestamp        date  time_of_day  depressed  prev_night_sleep
0    21 2019-12-07 21:33:11  2019-12-07            1          3          0.000000
1    21 2019-12-08 13:32:16  2019-12-08            1          3          0.000000
2    21 2019-12-08 13:33:41  2019-12-08            2          4          0.000000
3    21 2019-12-08 19:54:22  2019-12-08            3          4          0.000000
4    21 2019-12-08 19:55:24  2019-12-08            4          4          0.000000
5    21 2019-12-09 12:11:24  2019-12-09            1          4         11.350000
200  26 2020-02-12 20:16:33  2020-02-12            1          3          7.166667
201  26 2020-02-13 08:37:21  2020-02-13            1          3         10.183333
202  26 2020-02-13 13:24:20  2020-02-13            2          1         10.183333
203  26 2020-02-13 17:05:27  2020-02-13            3          2         10.183333
204  26 2020-02-13 20:02:12  2020-02-13            4          2         10.183333

answered Dec 02 '21 at 20:55

Corralien

109,409
8
28
52

What?...this is so simple. Thanks! I thought it was much more complicated. What is going on with cumcount().add(1)? How is this return the previous value of a duplicate? – James Dec 02 '21 at 21:11
Please take the time to check if this is really what you need. I changed many thing in the column `time_of_day`. I don't care about duplicate in fact, I group rows by (pid, date) and I set to each record (max 4) a number from 0 to 3 (`cumcount`) and add 1. – Corralien Dec 02 '21 at 21:16
Yeah, that's not what I want. It's changing values without duplicates to something else. – James Dec 02 '21 at 21:29
So IIUC, for [2, 2, 4, 4], the result should be [2-1, 2, 4-1, 4] so [1, 2, 3, 4]. But if, for a time_of_day, you have [1, 2, 3, 3], the result should be [1, 2, 3, 3+1] so [1, 2, 3, 4]? What is the output of `dput` for a such case? – Corralien Dec 03 '21 at 07:25
Thanks for your help with this @Corallien. Yes, you have that right, just to correct in your example [4-3] not 1, but in the subsequent list you provide, it is correct [1,2,3,4]. I don't know that there would ever be an instance in which you would have [1,2,3,3]; for now, I think we can exclude such a case because it would certainly be an anomaly, and I would have to look further into it. – James Dec 03 '21 at 17:57

score 0 · Answer 2 · answered Dec 03 '21 at 22:01

I didn't find a very easy way of doing. But the following code should work for this example.

import pandas as pd
import datetime
from pandas import Timestamp
df_dict = {'pid': {0: '21',
  1: '21',
  2: '21',
  3: '21',
  4: '21',
  5: '21',
  200: '26',
  201: '26',
  202: '26',
  203: '26',
  204: '26'},
 'datestamp': {0: Timestamp('2019-12-07 21:33:11'),
  1: Timestamp('2019-12-08 13:32:16'),
  2: Timestamp('2019-12-08 13:33:41'),
  3: Timestamp('2019-12-08 19:54:22'),
  4: Timestamp('2019-12-08 19:55:24'),
  5: Timestamp('2019-12-09 12:11:24'),
  200: Timestamp('2020-02-12 20:16:33'),
  201: Timestamp('2020-02-13 08:37:21'),
  202: Timestamp('2020-02-13 13:24:20'),
  203: Timestamp('2020-02-13 17:05:27'),
  204: Timestamp('2020-02-13 20:02:12')},
 'date': {0: datetime.date(2019, 12, 7),
  1: datetime.date(2019, 12, 8),
  2: datetime.date(2019, 12, 8),
  3: datetime.date(2019, 12, 8),
  4: datetime.date(2019, 12, 8),
  5: datetime.date(2019, 12, 9),
  200: datetime.date(2020, 2, 12),
  201: datetime.date(2020, 2, 13),
  202: datetime.date(2020, 2, 13),
  203: datetime.date(2020, 2, 13),
  204: datetime.date(2020, 2, 13)},
 'time_of_day': {0: 4,
  1: 2,
  2: 2,
  3: 4,
  4: 4,
  5: 2,
  200: 4,
  201: 1,
  202: 2,
  203: 3,
  204: 4},
 'depressed': {0: 3,
  1: 3,
  2: 4,
  3: 4,
  4: 4,
  5: 4,
  200: 3,
  201: 3,
  202: 1,
  203: 2,
  204: 2},
 'prev_night_sleep': {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 11.35,
  200: 7.166666666666667,
  201: 10.18333333333333,
  202: 10.18333333333333,
  203: 10.18333333333333,
  204: 10.18333333333333}}


data = pd.DataFrame.from_dict(df_dict)
to_change = data[data.duplicated(subset=['pid', 'date', 'time_of_day'], keep=False)].sort_values(['date', 'time_of_day'])

new_time_of_day = []
prev_t = 0
prev_d = ''
count = 1
for d, t in zip(to_change['date'], to_change['time_of_day']):
    if d == prev_d and t==prev_t:
        new_time_of_day.append(f'{t}-{count}')
        count+=1
    else:
        new_time_of_day.append(f'{t}')
        count = 1
        
    prev_d = d
    prev_t = t
    
del to_change['time_of_day']
to_change.insert(3, 'time_of_day', new_time_of_day)
data = data.drop_duplicates(subset=['pid', 'date', 'time_of_day'], keep=False)
data = pd.concat([to_change, data]).sort_index()
data.head(20)

creating if else statement to changing duplicate values to subsequent value (grouped by multiple columns)

2 Answers2