0

Blockquote

I have the following data frame:

        uniquie_id gsett               start                 end  catpri  rep
0        000001_01     2 1900-01-01 04:00:00 1900-01-01 07:50:00       1   23
1        000001_01     2 1900-01-01 07:50:00 1900-01-01 08:00:00       2    1
2        000001_01     2 1900-01-01 08:00:00 1900-01-01 08:10:00       3    1
3        000001_01     2 1900-01-01 08:10:00 1900-01-01 08:30:00       4    2
4        000001_01     2 1900-01-01 08:30:00 1900-01-01 09:50:00       5    8
5        000001_01     2 1900-01-01 09:50:00 1900-01-01 10:00:00       2    1
6        000001_01     2 1900-01-01 10:00:00 1900-01-01 11:50:00       6   11
7        000001_01     2 1900-01-01 11:50:00 1900-01-01 12:00:00       4    1
8        000001_01     2 1900-01-01 12:00:00 1900-01-01 13:50:00       4   11
9        000001_01     2 1900-01-01 13:50:00 1900-01-01 14:50:00       7    6
10       000001_01     2 1900-01-01 14:50:00 1900-01-01 15:50:00       8    6
11       000001_01     2 1900-01-01 15:50:00 1900-01-01 16:00:00       3    1
12       000001_01     2 1900-01-01 16:00:00 1900-01-01 16:50:00       3    5
13       000001_01     2 1900-01-01 16:50:00 1900-01-01 17:50:00       3    6
14       000001_01     2 1900-01-01 17:50:00 1900-01-01 18:00:00       9    1
15       000001_01     2 1900-01-01 18:00:00 1900-01-01 18:40:00       5    4
16       000001_01     2 1900-01-01 18:40:00 1900-01-01 18:50:00       9    1
17       000001_01     2 1900-01-01 18:50:00 1900-01-01 20:50:00       4   12
18       000001_01     2 1900-01-01 20:50:00 1900-01-01 21:20:00       7    3
19       000001_01     2 1900-01-01 21:20:00 1900-01-01 21:30:00       8    1
20       000001_01     2 1900-01-01 21:30:00 1900-01-01 21:50:00       8    2
21       000001_01     2 1900-01-01 21:50:00 1900-01-01 22:00:00      10    1
22       000001_01     2 1900-01-01 22:00:00 1900-01-02 00:50:00      10   17
23       000001_01     2 1900-01-01 00:50:00 1900-01-02 04:00:00      13   19

There are more the 1000000 I need to make a data frame that in each row I have unique_id as columns date from 1900-01-01 04:00 to 1900-01-02 04:00:00 with 10 minutes interval. The values to be put in each row/colums are the catpri values.

It should look somthing like this:

          1900-01-01 04:00:00 1900-01-01 04:10:00  ... 1900-01-02 03:50:00 1900-01-02 04:00:00
000001_01                   1                   1  ...                  13                  13

I have tried to wirte a function and use apply but it still take ages.

Any suggestion how to do this quikly?

Lorenzo Bottaccioli
  • 441
  • 1
  • 7
  • 20
  • Your expected output does not go well with what you said. Can you clarify? – Quang Hoang Apr 17 '19 at 18:54
  • 1
    Do you simply want a transpose of your dataframe? – AvyWam Apr 17 '19 at 18:56
  • To echo @AvyWam, this looks like a transpose or pivot/pivot_table, which can be significantly sped up using the built-in functions in pandas – G. Anderson Apr 17 '19 at 18:58
  • No I dont want to traspose. It is a kind of transpose my orginal data frame as for each row the value of catpri for done by the unique_id in that time interval. What I want at the end is one row for each unique_id and with 144 columns that starts from 1900-01-01 04:00 to 1900-01-02 04:00:00. – Lorenzo Bottaccioli Apr 17 '19 at 18:58
  • Possible duplicate of [How to pivot a dataframe](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe) – G. Anderson Apr 17 '19 at 19:00
  • @G.Anderson I dont think that those examples fit. In the col Column you find all the possible col values. In need to put the write catpri value between two date. If you have suggestion on how to use the pivot/pivot_table I will appreciate. – Lorenzo Bottaccioli Apr 17 '19 at 19:05
  • How should the `catpri` values be interpolated between missing timestamp intervals? Should it be the same as the previous valid data? – G. Anderson Apr 17 '19 at 20:54

1 Answers1

1

As in my comment, I believe this is a use case for df.pivot as outlined below:

df.pivot(index='unique_id',columns='start', values='catpri')

start   1900-01-01 00:50:00 1900-01-01 04:00:00 1900-01-01 07:50:00 1900-01-01 08:00:00 1900-01-01 08:10:00 1900-01-01 08:30:00 1900-01-01 09:50:00 1900-01-01 10:00:00 1900-01-01 11:50:00 1900-01-01 12:00:00 ... 1900-01-01 16:50:00 1900-01-01 17:50:00 1900-01-01 18:00:00 1900-01-01 18:40:00 1900-01-01 18:50:00 1900-01-01 20:50:00 1900-01-01 21:20:00 1900-01-01 21:30:00 1900-01-01 21:50:00 1900-01-01 22:00:00
unique_id                                                                                   
000001_01   13      1       2       3       4       5       2       6       4       4   ...     3       9       5       9       4       7       8       8       10      10

Including resampling (thanks to @FabienP) to get the missing intervals:

df.set_index('start').resample(rule='10T').ffill().reset_index().pivot(index='unique_id',columns='start', values='catpri')

start   1900-01-01 00:50:00 1900-01-01 01:00:00 1900-01-01 01:10:00 1900-01-01 01:20:00 1900-01-01 01:30:00 1900-01-01 01:40:00 1900-01-01 01:50:00 1900-01-01 02:00:00 1900-01-01 02:10:00 1900-01-01 02:20:00 ... 1900-01-01 20:30:00 1900-01-01 20:40:00 1900-01-01 20:50:00 1900-01-01 21:00:00 1900-01-01 21:10:00 1900-01-01 21:20:00 1900-01-01 21:30:00 1900-01-01 21:40:00 1900-01-01 21:50:00 1900-01-01 22:00:00
unique_id                                                                                   
000001_01   13  13  13  13  13  13  13  13  13  13  ... 4   4   7   7   7   8   8   8   10  10
1 rows × 128 columns
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • You can try [`resample`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html) then `ffill` to get `catpri` for all time in 10min interval. – FabienP Apr 17 '19 at 19:45
  • You're right, I didn't even notice the missing rows in the time series. – G. Anderson Apr 17 '19 at 20:53