1

I have a dataframe something like this:

Timestamp Value Type
2021-07-03 15:12:00 1 2.a
2021-07-03 16:11:00 1 2.a
2021-07-04 14:25:00 1 2.a
2021-07-04 15:50:00 1 2.a
2021-07-04 17:07:00 2 2.c
2021-07-04 18:06:00 2 2.c
2021-07-04 20:14:00 3 2.a
2021-07-05 00:00:00 3 2.a
df = pd.DataFrame({'Timestamp': ['2021-07-03 15:12:00',
  '2021-07-03 16:11:00',
  '2021-07-04 14:25:00',
  '2021-07-04 15:50:00',
  '2021-07-04 17:07:00',
  '2021-07-04 18:06:00',
  '2021-07-04 20:14:00',
  '2021-07-05 00:00:00'],
 'Value': [1, 1, 1, 1, 2, 2, 3, 3],
 'Type': ['2.a', '2.a', '2.a', '2.a', '2.c', '2.c', '2.a', '2.a']})

Where TimeStamp a DatetimeIndex, Value is some integer, and Type is some string. Every two rows represents a start/stop time pair, e.g. 2021-07-03 15:12:00 is the start of some occurrence, and 2021-07-03 16:11:00 is the end of the same occurrence. The Value and Type should be identical information for each pair, but will vary between pairs. Nothing makes this Value and Type pair unique from each other.

I'm looking for a way to downsample or squash each two rows into a single row so I'd end up with something like:

Start End Value Type
2021-07-03 15:12:00 2021-07-03 16:11:00 1 2.a
2021-07-04 14:25:00 2021-07-04 15:50:00 1 2.a
2021-07-04 17:07:00 2021-07-04 18:06:00 2 2.c
2021-07-04 20:14:00 2021-07-05 00:00:00 3 2.a

Where Start and End would remain pandas Timestamps, and the Value and Type of one of the pairs' rows can be tossed/remove/ignored, and the other is used as the value.

Is there any pandas way of doing this or should I be looping through rows myself to do something like this?

The closest post I've found is here but doesn't the solution assume the data set has a unique Name and Month by using groupby? In my dataset there is nothing unique about the Value and Type combination.

I also considered aggregate but couldn't manage to figure out how to do this particular operation.

topher217
  • 1,188
  • 12
  • 35
  • How do you differentiate different start-end pair with same date if the value and type are same? – devReddit Jul 21 '21 at 10:19
  • Simply by their order (i.e. the first two are one pair, the next two are another, etc.). If this data were to be resorted in any way, such information would be lost. – topher217 Jul 21 '21 at 10:22

3 Answers3

3

Given a CSV with...

Timestamp,              Value,      Type
2021-07-03  15:12:00,   1,          2.a
2021-07-03  16:11:00,   1,          2.a
2021-07-04  14:25:00,   1,          2.a
2021-07-04  15:50:00,   1,          2.a
2021-07-04  17:07:00,   2,          2.c
2021-07-04  18:06:00,   2,          2.c
2021-07-04  20:14:00,   3,          2.a
2021-07-05  00:00:00,   3,          2.a

Try:

import pandas as pd

# read csv from file
df = pd.read_csv('test_csv.csv')

# change column 'Timestamp' to 'Start'
df.rename(columns={'Timestamp':'Start'}, inplace=True)

# set column 'Start' to datetime
df['Start'] = df['Start'].astype('datetime64[ns]')

# create a new column called 'End' and populate with the values of 'Start' but shifted "up one row"
df.insert(1, 'End', df['Start'].shift(-1))

# delete every other row
df = df.iloc[::2, :]

# output df to check
print(df)

Outputs:

    Start                   End                     Value   Type
0   2021-07-03 15:12:00     2021-07-03 16:11:00     1       2.a
2   2021-07-04 14:25:00     2021-07-04 15:50:00     1       2.a
4   2021-07-04 17:07:00     2021-07-04 18:06:00     2       2.c
6   2021-07-04 20:14:00     2021-07-05 00:00:00     3       2.a
MDR
  • 2,610
  • 1
  • 8
  • 18
  • 1
    [Shift](https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html?highlight=pandas%20shift#pandas.Series.shift)! Yes, this is a key method I was missing. – topher217 Jul 21 '21 at 10:45
  • One issue with this is your csv imports the "Timestamp" column as a string, whereas I have a column of pd.Timestamps. When I tried using this on my df only the date was inserted into End after being shifted. Not sure why this is, but I can just convert to string then use this. – topher217 Jul 21 '21 at 11:02
  • 1
    Np. I've updated with an extra line of `df['Start'] = df['Start'].astype('datetime64[ns]')` – MDR Jul 21 '21 at 13:37
1
  • you can merge() odd and even rows
  • columns can be named / dropped using rename() and drop()
df = pd.DataFrame({'Timestamp': ['2021-07-03 15:12:00',
  '2021-07-03 16:11:00',
  '2021-07-04 14:25:00',
  '2021-07-04 15:50:00',
  '2021-07-04 17:07:00',
  '2021-07-04 18:06:00',
  '2021-07-04 20:14:00',
  '2021-07-05 00:00:00'],
 'Value': [1, 1, 1, 1, 2, 2, 3, 3],
 'Type': ['2.a', '2.a', '2.a', '2.a', '2.c', '2.c', '2.a', '2.a']})


df.loc[(df.index % 2) == 0].assign(jid=lambda d: d.index // 2).merge(
    df.loc[(df.index % 2) == 1].assign(jid=lambda d: d.index // 2), on="jid"
)
Timestamp_x Value_x Type_x jid Timestamp_y Value_y Type_y
0 2021-07-03 15:12:00 1 2.a 0 2021-07-03 16:11:00 1 2.a
1 2021-07-04 14:25:00 1 2.a 1 2021-07-04 15:50:00 1 2.a
2 2021-07-04 17:07:00 2 2.c 2 2021-07-04 18:06:00 2 2.c
3 2021-07-04 20:14:00 3 2.a 3 2021-07-05 00:00:00 3 2.a
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Neat! Thanks! It looks like I need to assign the return value from your merge to df (or some other var), but otherwise looks to work. – topher217 Jul 21 '21 at 10:41
  • do you know any way of generating the code you have to create df given a csv file or table somewhere else? The best I could find to illustrate this on SO was just showing it in Markdown using something like [this](https://www.tablesgenerator.com/markdown_tables). Is there any automatic way to share a df generator like this if I didn't generate it like this in the first place? If not, thanks for doing the manual labor! I'll add it to my post for others sake. – topher217 Jul 21 '21 at 10:49
  • `print(df.to_markdown())` and paste into SO. I didn't type it in I copied table and used `pd.read_csv(io.StringIO("""...."""), sep="\t")` – Rob Raymond Jul 21 '21 at 13:23
  • Ah, I meant the generation of the `df = pd.DataFrame({..` code from an existing table. Given I have a csv file with data. What is the best way to generate the `df = pd.DataFrame( ...` code from this so other SO users can load it without downloading a csv file? I saw your original post with the tab parser, but now it appears it is comma separated. Just wondered if you had a method for generating this as it would seem useful when asking pandas questions here on SO. – topher217 Jul 21 '21 at 14:07
  • there are multiple ways to initialise a data frame. all you need is `df = pd.read_csv(...)`. the code I have provided does not depend on how DF has been initialised – Rob Raymond Jul 21 '21 at 14:10
  • Let me start over. In your answer, and in many good answers/questions related to pandas on Stack Overflow, people include the initialization of a df as pure code as opposed to a markdown table, or csv file. Although these are also valid options, the pure code initialization allows users to copy and paste working code without downloading anything. On the flip side, many people are not generating their dataframes from pure code (they have some csv file they read from). Therefore I thought it would be nice if there existed a tool to generate the pure code initialization text from csv or the like. – topher217 Jul 21 '21 at 14:22
  • got it - to post a good question include data without more work required is best practice. The way I generated the above was `df.to_dict("list")` and paste **dict** in `pd.DataFrame()`. I do answer a number of questions, questions put together in this way are more likely to attract answers as there is no work required by answerer to get to a starting point ... if your DF is big, `df.head(10).to_dict("list")` will often provide a decent starting point – Rob Raymond Jul 21 '21 at 14:43
0

Let's set up the initial problem:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Timestamp': [
'2021-07-03 15:12:00',
'2021-07-03 16:11:00',
'2021-07-04 14:25:00',
'2021-07-04 15:50:00',
'2021-07-04 17:07:00',
'2021-07-04 18:06:00',
'2021-07-04 20:14:00',
'2021-07-05 00:00:00'
],
'Value': [1, 1, 1, 1, 2, 2, 3, 3],
'Type': ['2.a', '2.a', '2.a', '2.a', '2.c', '2.c', '2.a', '2.a']}
)

I'd put an additional column called "extremity" whose values would alternate between "start" and "end", and a column indicating the trial number.

df.loc[:,'Extremity'] = ''
df.loc[1::2,'Extremity'] = 'end'
df.loc[::2,'Extremity'] = 'start'
df.loc[:,'Trial'] = [i//2 + 1 for i in range(len(df))]

Let's look at it now.

df
             Timestamp  Value Type Extremity  Trial
0  2021-07-03 15:12:00      1  2.a     start      1
1  2021-07-03 16:11:00      1  2.a       end      1
2  2021-07-04 14:25:00      1  2.a     start      2
3  2021-07-04 15:50:00      1  2.a       end      2
4  2021-07-04 17:07:00      2  2.c     start      3
5  2021-07-04 18:06:00      2  2.c       end      3
6  2021-07-04 20:14:00      3  2.a     start      4
7  2021-07-05 00:00:00      3  2.a       end      4

Now we're going to get the times corresponding to the same trial in the same row by pivoting

pivoted_df = df.pivot(index='Trial',columns='Extremity',values=['Timestamp','Value','Type'])
pivoted_df
                     Timestamp                      Value       Type      
Extremity                  end                start   end start  end start
Trial                                                                     
1          2021-07-03 16:11:00  2021-07-03 15:12:00     1     1  2.a   2.a
2          2021-07-04 15:50:00  2021-07-04 14:25:00     1     1  2.a   2.a
3          2021-07-04 18:06:00  2021-07-04 17:07:00     2     2  2.c   2.c
4          2021-07-05 00:00:00  2021-07-04 20:14:00     3     3  2.a   2.a

After sanity checking that the (Value, start) column equals the (Value, end) and that the (Type, start) equals the (Type, end), we can eliminate the duplicates and give the columns nicer names.

pivoted_df = pivoted_df.drop([('Value','start'),('Type','start')], axis='columns')
pivoted_df.columns = ['end_time','start_time','value','type']
pivoted_df
                end_time             start_time value type
Trial                                                     
1      2021-07-03 16:11:00  2021-07-03 15:12:00     1  2.a
2      2021-07-04 15:50:00  2021-07-04 14:25:00     1  2.a
3      2021-07-04 18:06:00  2021-07-04 17:07:00     2  2.c
4      2021-07-05 00:00:00  2021-07-04 20:14:00     3  2.a
Philip Egger
  • 326
  • 1
  • 11