1

First, I want to forward fill my data for EACH UNIQUE VALUE in Group_Id by 1S, so basically grouping by Group_Id then resample using ffill.

Here is the data:

    Id           Timestamp         Data    Group_Id    
0    1     2018-01-01 00:00:05.523 125.5   101 
1    2     2018-01-01 00:00:05.757 125.0   101 
2    3     2018-01-02 00:00:09.507 127.0   52  
3    4     2018-01-02 00:00:13.743 126.5   52  
4    5     2018-01-03 00:00:15.407 125.5   50
                    ...

11   11    2018-01-01 00:00:07.523 125.5   120 
12   12    2018-01-01 00:00:08.757 125.0   120 
13   13    2018-01-04 00:00:14.507 127.0   300  
14   14    2018-01-04 00:00:15.743 126.5   300  
15   15    2018-01-05 00:00:19.407 125.5   350

I previously did this:

def daily_average_temperature(dfdf):
    INDEX = dfdf[['Group_Id','Timestamp','Data']]
    INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
    INDEX = INDEX.set_index('Timestamp')               
    INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')

    return T_index1

This is wrong as it didn't group the data with different value of Group_Id first but rather ignoring the column.

Second, I would like to spread the Data values so each row is a group_id with index as columns replacing Timestamp, looks something like this:

    x0      x1      x2      x3      x4      x5      ...   Group_Id
0   40      31.05   25.5    25.5    25.5    25      ...   1
1   35      35.75   36.5    36.5    36.5    36.5    ...   2
2   25.5    25.5    25.5    25.5    25.5    25.5    ...   3
3   25.5    25.5    25.5    25.5    25.5    25.5    ...   4
4   25      25      25      25      25      25      ...   5
⋮    ⋮       ⋮        ⋮       ⋮       ⋮        ⋮             ⋮

Please note that this table above is not related to the previous dataset but just used to show the format.

Thanks

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

1 Answers1

1

Use DataFrame.groupby with DataFrameGroupBy.resample:

def daily_average_temperature(dfdf):
    dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
    dfdf = (dfdf.set_index('Timestamp')
                .groupby('Group_Id')['Data']
                .resample('1S')
                .last()
                .ffill()
                .reset_index())

    return dfdf

print (daily_average_temperature(dfdf))
    Group_Id           Timestamp   Data
0         50 2018-01-03 00:00:15  125.5
1         52 2018-01-02 00:00:09  127.0
2         52 2018-01-02 00:00:10  127.0
3         52 2018-01-02 00:00:11  127.0
4         52 2018-01-02 00:00:12  127.0
5         52 2018-01-02 00:00:13  126.5
6        101 2018-01-01 00:00:05  125.0
7        120 2018-01-01 00:00:07  125.5
8        120 2018-01-01 00:00:08  125.0
9        300 2018-01-04 00:00:14  127.0
10       300 2018-01-04 00:00:15  126.5
11       350 2018-01-05 00:00:19  125.5

EDIT: This solution use minimal and maximal datetimes for DataFrame.reindex by date_range in DattimeIndex in columns after reshape by Series.unstack, also is added back filling if necessary:

def daily_average_temperature(dfdf):
    dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
    #remove ms for minimal and maximal seconds in data
    s = dfdf['Timestamp'].dt.floor('S')

    dfdf = (dfdf.set_index('Timestamp')
                .groupby('Group_Id')['Data']
                .resample('1S')
                .last()
                .unstack()
                .reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
                .rename_axis('Timestamp', axis=1)
                .bfill(axis=1)
                .ffill(axis=1)
                .stack()
                .reset_index(name='Data')
                )

    return dfdf

df = daily_average_temperature(dfdf)

print (df['Group_Id'].value_counts())

350    345615
300    345615
120    345615
101    345615
52     345615
50     345615
Name: Group_Id, dtype: int64

Another solution is similar, only date_range is specified by values from strings (not dynamic by min and max):

def daily_average_temperature(dfdf):
    dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
    #remove ms for minimal and maximal seconds in data
    s = dfdf['Timestamp'].dt.floor('S')

    dfdf = (dfdf.set_index('Timestamp')
                .groupby('Group_Id')['Data']
                .resample('1S')
                .last()
                .unstack()
                .reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'), 
                         axis=1, method='ffill')
                .rename_axis('Timestamp', axis=1)
                .bfill(axis=1)
                .ffill(axis=1)
                .stack()
                .reset_index(name='Data')
                )

    return dfdf

df = daily_average_temperature(dfdf)

print (df['Group_Id'].value_counts())
350    604801
300    604801
120    604801
101    604801
52     604801
50     604801
Name: Group_Id, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • is it possible to also extend the starting and ending time for all `Group_Id` to `2018-01-01 00:00:00.000` and `2018-01-08 00:00:00.000` respectively to make the number of observation for each `Group_Id` the same? Thanks – nilsinelabore Oct 12 '19 at 00:35
  • Thanks for your answer, but I think I still don't get what I want. I want to have the same number of counts for each `Group_Id`, given they have the same time period and time interval. I used `dfdf1= dfdf.groupby(['Group_Id']).count() dfdf1.count` and it seems the counts are different for all `Group_Id`. Do you know what went wrong? Thanks! – nilsinelabore Oct 14 '19 at 10:42
  • 1
    @nilsinelabore - You are right, there is necessary also add `ffill(axis=1)`, answer was edited. – jezrael Oct 14 '19 at 10:52
  • 1
    The data was supposed to be a time series with fixed time interval, ie. `1S`, I want to set the same starting and ending time(`2018-01-01 00:00:00.000` and `2018-01-08 00:00:00.000` = 7 days) so that all `Group_Id` will have the same number of counts/rows of `7*24*60*60=604800` for each second. – nilsinelabore Oct 14 '19 at 10:52
  • Awesome, just wondering why the 2 solutions give different counts? – nilsinelabore Oct 14 '19 at 10:59
  • @nilsinelabore - because first use min and max and second use `'2018-01-01','2018-01-08'` – jezrael Oct 14 '19 at 11:00
  • I think you missed out `.reindex(pd.date_range('2018-01-01 00:00:00.000', '2018-01-08 00:00:00.000', freq='S'), axis=1, method='ffill')` for first solution, but it worked for me:) Second solution might have something wrong as it's taking only `date` from the `datetime` data. Thanks a lot – nilsinelabore Oct 14 '19 at 11:12
  • @nilsinelabore - yes, I omit it, but no problem if use it. – jezrael Oct 14 '19 at 11:14
  • It looked a bit different on my whole data if I drop it but I'm happy adding that in:) – nilsinelabore Oct 14 '19 at 11:15
  • Could you please help me with this question? https://stackoverflow.com/questions/58384682/how-to-transpose-time-series-by-unique-values-in-a-column I tried as suggested by others but it really does not help me to learn or solve this question. Thanks – nilsinelabore Oct 14 '19 at 22:52
  • @nilsinelabore - I check comments and not sure, can you more clarify what is problem? – jezrael Oct 15 '19 at 07:40
  • basically I wanted to have 1 row of data for each `Group_Id `, so all `Group_Id ` are in the first column, then I want the timestamps lined up from second column all the way to the last column, just looks like the table in the question. `x0 x1 x2 x3 x4 ...` are the new column names replacing the original timestamps (as I have same time interval from this question) – nilsinelabore Oct 15 '19 at 08:03
  • could you please consider this question?https://stackoverflow.com/questions/58514020/how-to-extract-data-from-df2-based-on-datetime-data-in-df1-in-python Thank you so much! – nilsinelabore Oct 23 '19 at 12:49
  • @nilsinelabore - Answer was added. – jezrael Oct 23 '19 at 13:58