0

Have the following simplified dataframe:

Date           Name       Score       V       H       M
2018-01-01       A          5         V1      H4      M6
2018-01-01       B          3         V5      H2      M1
2018-01-01       C          4         V7      H6      M6
2018-01-01       A          4         V11     H9      M3
2018-01-01       C          2         V4      H2      M18
2018-01-02       A          4         V9      H1      M9
2018-01-02       B          1         V15     H4      M10
2018-01-02       A          3         V10     H10     M14
2018-01-03       C          5         V5      H21     M34
2018-01-04       A          3         V8      H9      M6
2018-01-04       A          4         V4      H15     M9
2018-01-04       C          2         V2      H4      M14
2018-01-04       B          5         V1      H1      M2

Looking at the above dataframe as the raw set, I've further indexed it by date and resampled at a Monthly level. What I'd eventually like to do is create individual timeseries for unique values in the columns (Name, V, H, M) with respect to the score (accomplished by grouping). While I've downsampled, I expect irregular timeseries sizes across these columns' grouped unique values and plan to interpolate to handle that.

The goal is to create and extract multiple time series into a new dataframe and to explore their correlation maps. For example, I would have individual time series for V1, V2, ... , Vn, H1, H2, ... , Hn, M1, M2, ... ,Mn, and so on.

I'm not sure if this should be all captured in one new dataframe or multiple dataframes based on the groupings. Here's an example of what the timeseries output should look like:

        Date           Score         
V1      2018-01-01      4.5
        2018-02-01      4.1
        2018-03-01      4.3
        2018-04-01      4.2
        2018-05-01      4.4

        Date           Score         
V2      2018-01-01      4.5
        2018-02-01      4.1
        2018-03-01      4.3
        2018-04-01      4.2
        2018-05-01      4.4

        Date           Score         
V3      2018-01-01      4.5
        2018-02-01      4.1
        2018-03-01      4.3
        2018-04-01      4.2
        2018-05-01      4.4

I need help in implementing an efficient way to do this and to know if I'm on the right track. The dataframe above is a simplified version of a larger dataset.

Appreciate any help and guidance.

jarwal
  • 25
  • 6
  • 1
    What is the expected output? – Franco Piccolo Nov 04 '18 at 18:12
  • Hi, edited the post to clarify expected output. I'm looking to extract timeseries from the raw into a new dataframe (or possibly multiple dataframes based on the groupings taken from the columns). Apologies for the confusion. – jarwal Nov 04 '18 at 18:27
  • So the code should go from daily to monthly? And how do you handle cases where there is more than one value in the same time period for the same key? Do you average them? – Franco Piccolo Nov 04 '18 at 18:49
  • That's right. It's not a periodic daily in the sense that some days may be missing here or there and on some days there may actually be multiple scores for any of the unique values in any of the columns. So I handle that by re-sampling to monthly and also by taking the mean I'm able to accommodate for the duplicates. – jarwal Nov 04 '18 at 18:57

2 Answers2

1

You could use concat to move the columns Name, V, H, M below each other. After that you can fill the scores with tile. Finally you can apply resample, mean and interpolate to the groups like:

import numpy as np

df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df2 = pd.concat([df.Name, df.V, df.H, df.M]).to_frame(name='Name')
df2['Score'] = np.tile(df['Score'].values,4)
df2.groupby('Name').apply(lambda x: x.resample('M').mean().interpolate())
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52
  • Thank you very much for this approach. np.tile() comes in very handy after the concat(). Is there a way I can parse out the timeseries based on their groupings? For example can I iteratively fill a new dataframe with unique groupings from column Name, V, H, M? – jarwal Nov 05 '18 at 02:09
  • I don't understand your question.. Maybe you can edit your question with some expected output for clarification. – Franco Piccolo Nov 05 '18 at 04:48
1

I tried this with pandas melt. I was trying to use a double melt, but this SO post explained that cannot be done so I had to (1) melt (stack) on the columns excluding Score columns to give df_modmelt and then (b) append a new column that was simply a repeated version of the Score column based on this idea - essentially, the same column has been repeated (vertically concatenated) n times: I did this using (len(df_modmelt)/len(df)). This is justified because, after melting, I did not change the row order so each block in the melted dataframe df_modmelt is in the same order as the original data and appending the Scores column preserves the association between the Scores column and the others. Then just use Grouper with key='Date' and freq='M'.

Here's the sample code (I appended 3 rows to your sample data to get some values with a 2nd month)

df['Date'] = pd.to_datetime(df['Date'])

         Date Name  Score    V    H    M
0  2018-01-01    A      5   V1   H4   M6
1  2018-01-01    B      3   V5   H2   M1
2  2018-01-01    C      4   V7   H6   M6
3  2018-01-01    A      4  V11   H9   M3
4  2018-01-01    C      2   V4   H2  M18
5  2018-01-02    A      4   V9   H1   M9
6  2018-01-02    B      1  V15   H4  M10
7  2018-01-02    A      3  V10  H10  M14
8  2018-01-03    C      5   V5  H21  M34
9  2018-01-04    A      3   V8   H9   M6
10 2018-01-04    A      4   V4  H15   M9
11 2018-01-04    C      2   V2   H4  M14
12 2018-01-04    B      5   V1   H1   M2
13 2018-02-01    A      4   V5  H11   M9
14 2018-02-01    C      2   V7   H5  M14
15 2018-02-02    B      5   V1   H7   M2

Code

cols = ['V','H','M','Name']
df_modmelt = (df[['Date']+cols]
                .melt(id_vars=['Date'],
                    var_name='column',
                    value_name='value'))
df_modmelt['Score'] = (pd.concat([df['Score']]*(len(df_modmelt)/len(df)))
                        .reset_index(drop=True))
df_final = (df_modmelt
        .groupby(['column','value', pd.Grouper(key='Date', freq='M')])['Score']
        .mean()
        .reset_index(drop=False))
df_final['Date'] = df_final['Date'].dt.floor('d') - pd.offsets.MonthBegin(1)
print(df_final)

Here is the output

   column value       Date     Score
0       H    H1 2018-01-01  4.500000
1       H   H10 2018-01-01  3.000000
2       H   H11 2018-02-01  4.000000
3       H   H15 2018-01-01  4.000000
4       H    H2 2018-01-01  2.500000
5       H   H21 2018-01-01  5.000000
6       H    H4 2018-01-01  2.666667
7       H    H5 2018-02-01  2.000000
8       H    H6 2018-01-01  4.000000
9       H    H7 2018-02-01  5.000000
10      H    H9 2018-01-01  3.500000
11      M    M1 2018-01-01  3.000000
12      M   M10 2018-01-01  1.000000
13      M   M14 2018-01-01  2.500000
14      M   M14 2018-02-01  2.000000
15      M   M18 2018-01-01  2.000000
16      M    M2 2018-01-01  5.000000
17      M    M2 2018-02-01  5.000000
18      M    M3 2018-01-01  4.000000
19      M   M34 2018-01-01  5.000000
20      M    M6 2018-01-01  4.000000
21      M    M9 2018-01-01  4.000000
22      M    M9 2018-02-01  4.000000
23   Name     A 2018-01-01  3.833333
24   Name     A 2018-02-01  4.000000
25   Name     B 2018-01-01  3.000000
26   Name     B 2018-02-01  5.000000
27   Name     C 2018-01-01  3.250000
28   Name     C 2018-02-01  2.000000
29      V    V1 2018-01-01  5.000000
30      V    V1 2018-02-01  5.000000
31      V   V10 2018-01-01  3.000000
32      V   V11 2018-01-01  4.000000
33      V   V15 2018-01-01  1.000000
34      V    V2 2018-01-01  2.000000
35      V    V4 2018-01-01  3.000000
36      V    V5 2018-01-01  4.000000
37      V    V5 2018-02-01  4.000000
38      V    V7 2018-01-01  4.000000
39      V    V7 2018-02-01  2.000000
40      V    V8 2018-01-01  3.000000
41      V    V9 2018-01-01  4.000000

Edit

the last line pd.offsets.MonthBegin(1) is just returning the first date of the month.

EDIT 2

Per request to get additional information about correlation between unique values in the value column of df_final - see useful links here:

for c in ['Name','H']:
    df_pivot = (df_final[df_final.value.isin(df[c].unique().tolist())]
                .pivot_table(index='Date', columns='value', values='Score'))

Output for column Name

print(df_pivot)
value              A    B     C
Date                           
2018-01-01  3.833333  3.0  3.25
2018-02-01  4.000000  5.0  2.00

print(df_pivot.corr())
value    A    B    C
value               
A      1.0  1.0 -1.0
B      1.0  1.0 -1.0
C     -1.0 -1.0  1.0

Output for column H

print(df_pivot)
value        H1  H10  H11  H15   H2  H21        H4   H5   H6   H7   H9
Date                                                                  
2018-01-01  4.5  3.0  NaN  4.0  2.5  5.0  2.666667  NaN  4.0  NaN  3.5
2018-02-01  NaN  NaN  4.0  NaN  NaN  NaN       NaN  2.0  NaN  5.0  NaN

print(df_pivot.corr())
value  H1  H10  H11  H15  H2  H21  H4  H5  H6  H7  H9
value                                                
H1    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H10   NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H11   NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H15   NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H2    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H21   NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H4    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H5    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H6    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H7    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
H9    NaN  NaN  NaN  NaN NaN  NaN NaN NaN NaN NaN NaN
edesz
  • 11,756
  • 22
  • 75
  • 123
  • Thanks very much for this comment, this is a really interesting output. So, if I wanted to group by the unique column values (e.g. groupby H1, H2, H3, etc), would I use something like df.get_group(unique_value)? How can I do this in an iterative fashion? My goal is to feed these timeseries into a correlation map (1 map per Name, V, H, and M) – jarwal Nov 05 '18 at 02:17
  • I have posted one way to do this for columns `H` and `Name` using pandas `pivot_table`. In the sample data, there will be `NaN`s where data is missing. Presumable in your real data, there are fewer `NaN`s. FYI: Probably best to ask for modifications in a new Question....it keeps the OP clean and structured. – edesz Nov 05 '18 at 03:12
  • Thank you again - really appreciate your methodology and the resource links you provided. Super informative. I'll go through them and any questions I have for modifications, I'll be sure to ask in a new question. Appreciate your time. – jarwal Nov 05 '18 at 03:38