1

Goal

I am trying to prepare data for time series classification using sktime and minirocket. The classification will work on time windows of 20 seconds. The data need to be given to minirocket as multi index pandas dataframes, like this one for example :

id   name    20_s_window
1    A       0.469112 0.684662 1.462547
     B      -0.282863 ...
2    A      -1.509059 ... 
     B      -1.135632 ...
3    A       1.212112 ...
     B      -0.173215 ...
4    A       0.119209 ...
     B      -1.044236 ...

Each "id" row represents 20 seconds of data.

But my current dataframe looks like this :

     A          B
1    0.469112   -0.282863
2    0.684662   ...

With all data for A in the same column, etc. And the row index being time in ms.

What I've tried

The closest I got to the expected result is using these lines of code :

df["timestamp"] = pd.to_datetime(df.index, unit="ms")
df_transpose = df.groupby(pd.Grouper(key="timestamp", freq="20s")).apply(
    lambda x: [x[track].tolist() for track in df.columns]
)

Which gives this result :

timestamp
1970-01-01 00:05:00    [[69.19940185546875, 68.21199798583984, 68.211...
1970-01-01 00:05:20    [[82.0363998413086, 81.04889678955078, 79.0739...
1970-01-01 00:05:40    [[67.22450256347656, 67.22450256347656, 67.224...
1970-01-01 00:06:00    [[62.287200927734375, 63.27470016479492, 62.28...

The issue is that every column end up in the same list on the same row, instead of have a separate index.

I also tried creating the multi index dataframe first with this :

col = ["A", "B"]
indexes = list(range(10))
iterables = [indexes, col]
df = pd.MultiIndex.from_product(iterables, names=["col", "index"])

Which gives the following object :

MultiIndex([(0, 'A'),
            (0, 'B'),
            (1, 'A'),
            (1, 'B'),
            (2, 'A'),
            (2, 'B'),
            (3, 'A'),
            (3, 'B'),
            (4, 'A'),
            (4, 'B')],
           names=['col', 'index'])

But I have no idea how to put data from the original dataframe in it.

Reproducible example

Data

import pandas as pd

tracks = ["A", "B", "C"]
values = [[1, 5, 9], [2, 6, 10], [3, 7, 11], [4, 8, 12]]
df = pd.DataFrame(values, columns=tracks)

df["timestamp"] = pd.to_datetime(df.index, unit="s")
grouper = pd.Grouper(key="timestamp", freq="2s")
df_transpose = df.groupby(grouper).apply(
    lambda x: [x[track].tolist() for track in df.columns if track != "timestamp"]
)

Let's say I want to group this dataframe by windows of 2 rows, my goal is to make this example look like this :

id name values
1  A    1  2
1  B    5  6
1  C    9  10
2  A    3  4
2  B    7  8
2  C    11 12

But it looks like this:

id
1970-01-01 00:00:00    [[1, 2], [5, 6], [9, 10]]
1970-01-01 00:00:02    [[3, 4], [7, 8], [11, 12]]

Is there a way to do it ? Thank you in advance.

Edit after answer

I misunderstood the input structure of sktime, so in the end I only used half of @mozway column answer :

df["timestamp"] = pd.to_datetime(df.index, unit="ms")
grouper = pd.Grouper(key="timestamp", freq="20s")
out = (
    df.assign(
        window=(g := df.groupby(grouper)).ngroup().add(1), row=g.cumcount().add(1)
    )
    .drop(columns="timestamp")
    .set_index(["window", "row"])
)

In the end it looks like this

            A  B   C
window row          
1      1    1  5   9
       2    2  6  10
2      1    3  7  11
       2    4  8  12
  • Can you try: `df.rename_axis(index='id', columns='name').stack().to_frame(name='20_s_window')`. If this doesn't work the way you want, please provide a reproducible input (as code, and without `...`) – mozway Jun 28 '23 at 11:29
  • Thank you for your comment, your solution doesn'comes close to what I'd want, but only puts one value per row, not a window, unless I misunderstood where I was supposed to use it... I added code at the end to have a dataframe similar to what I work with. – Côme Vincent Jun 28 '23 at 12:23
  • It's not what you want and the question is better now, one more ambiguity, do you have 2 "values" columns in the output? – mozway Jun 28 '23 at 12:26

2 Answers2

1

With respect to your issue "The issue is that it only uses 1 of my columns and puts it in a list.": This is because of the way you are using apply:

  1. If you compute within an apply-call (on a groupby-object) a value which is not a native pandas object (i.e. no series/DataFrame) then you will just get the return value as a single list
  2. Proposal to resolve this: Try returning a Series instead of a list: i.e. turn
lambda x: x["A"].tolist()

into something like:

lambda x: pd.Series(x["A"].tolist(), index=...)

In general:

u = df.groupby(lambda x: x//20)
u.apply(lambda x: pd.DataFrame([x["A"].values,x["B"].values], index=["A","B"]))

This should produce a multiindex

P.Jo
  • 532
  • 3
  • 9
  • Thank you for your answer, I changed this part as it was not very clear. My issue is more about getting each column from the original dataframe its own index rather than getting rid of the list. I'll soon add a reproducible example to my question. – Côme Vincent Jun 28 '23 at 12:02
  • Ok, I edited my post. Maybe you can try my suggestion. – P.Jo Jun 28 '23 at 12:03
  • Your general solution yields a instead of a dataframe, and when converting it using `pd.DataFrame` it prints a [messed up table](https://cl1p.net/messed_dataframe) that I'm not sure I understand... – Côme Vincent Jun 28 '23 at 12:37
  • Are you sure? The u variable is indeed a pandas.core.groupby.generic.DataFrameGroupBy object, but the result of the second line should yield a DataFrame. And btw: I assume df is your original DataFrame - not the one were you added timestamps etc. – P.Jo Jun 28 '23 at 12:46
  • I'm sorry you were right, I just assumed the apply function was inplace, so I tried printing u directly. `u.apply` is indeed a dataframe, but it's not quite the right shape... The answer suggested by mozway works though, so I'll be using it. Thank you for your time ! – Côme Vincent Jun 28 '23 at 12:55
0

It looks like you want to reshape your DataFrame:

grouper = pd.Grouper(key="timestamp", freq="2s")

out = (df
  .assign(id=df.groupby(grouper).ngroup().add(1))
  .drop(columns='timestamp').rename_axis(columns='name')
  .set_index(['id']).stack()
  .groupby(level=['id', 'name']).agg(list)
  .reset_index(name='values')
)

Output:

   id name    values
0   1    A    [1, 2]
1   1    B    [5, 6]
2   1    C   [9, 10]
3   2    A    [3, 4]
4   2    B    [7, 8]
5   2    C  [11, 12]

If you want multiple columns:

grouper = pd.Grouper(key="timestamp", freq="2s")

out = (df
  .assign(id=(g:=df.groupby(grouper)).ngroup().add(1),
          col=g.cumcount().add(1)
         )
  .drop(columns='timestamp').rename_axis(columns='name')
  .set_index(['id', 'col']).stack()
  .unstack('col').add_prefix('value').reset_index()
)

Output:

col  id name  value1  value2
0     1    A       1       2
1     1    B       5       6
2     1    C       9      10
3     2    A       3       4
4     2    B       7       8
5     2    C      11      12
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you, your solution worked perfectly ! I'll post an update when I know if it worked with sktime and minirocket too. – Côme Vincent Jun 28 '23 at 12:49
  • @Côme so which one did you want? lists or multiple columns? – mozway Jun 28 '23 at 12:50
  • Well after more reading and experimentation, it appears that I misunderstood sktime's input structure, but I was close, so I used only half of your answer with columns : I kept the assign, drop and set_index methods without the unstack. – Côme Vincent Jun 28 '23 at 13:41
  • 1
    I edited my question to include the solution I used – Côme Vincent Jun 28 '23 at 13:53