Transform pandas dataframe from tsv to parquet

Question

I'm trying to kind of pivot a pandas dataframe, but with a "twist" I guess? So heres the input table (tsv) that gets loaded into pandas.

timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7

The output should look like this

timestamp   sensor_a sensor_a_type sensor_a_unit sensor_b sensor_b_type sensor_b_unit
1607724078  coords1  string        gps           1        int           bar
1607724079  coords5  string        gps           4        int           bar
1607724080  coords9  string        gps           7        int           bar

So as you see, it should be grouped by timestamp and the value of the sensor column has to be defined as a separate column with the value from the original value column. Furthermore the original type and unit columns have to be "name-merged" with the new sensor_x column and keep their values.

The goal is to save it as parquet. Important: there can be more than two sensors.

For now I accomplished to create a MultiIndexed dataframe that includes all the information I need but I can't write it to parquet because I guess the MultiIndex has to be flattend somehow, so that the sensor unit and type have their own separate column.

import pandas as pd

if __name__ == '__main__':
    df = pd.read_csv("data/test.tsv", delimiter='\t', index_col='timestamp')
    df2 = df.pivot(columns=['sensor', 'unit', 'type'], values='value')
    print(df2)

Output

sensor     sensor_a sensor_b sensor_c
unit            gps      bar  percent
type         string      int   double
timestamp                            
1607724078  coords1        1    11.11
1607724079  coords5        4    44.44
1607724080  coords9        7    77.77

Thanks in advance!

score 0 · Answer 1 · answered Dec 12 '20 at 01:29

Here is a partial answer... just got to rename some columns:

from io import StringIO
sList = '''timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7'''.split('\n')
s = '\n'.join([','.join(l.split()) for l in sList])

with StringIO(s) as sio:
    df = pd.read_csv(sio)

df[df.sensor == 'sensor_a'].merge(df[df.sensor == 'sensor_b'], on='timestamp', suffixes=['_a','_b'])

Output

    timestamp   sensor_a    type_a  unit_a  value_a sensor_b    type_b  unit_b  value_b
0   1607724078  sensor_a    string  gps coords1 sensor_b    int bar 1
1   1607724079  sensor_a    string  gps coords5 sensor_b    int bar 4
2   1607724080  sensor_a    string  gps coords9 sensor_b    int bar 7

score 0 · Answer 2 · answered Dec 12 '20 at 02:46

You could replace the columns with just the top level per this solution

df2.columns = df2.columns.get_level_values(0)

which should result in something like

            sensor_a sensor_b sensor_c
1607724078  coords1        1    11.11
1607724079  coords5        4    44.44
1607724080  coords9        7    77.77

score 0 · Answer 3 · answered Dec 12 '20 at 03:09

You can reshape the dataframe using set_index and unstack, then flatten the multiindex using list comprehension like this:

import pandas as pd    
from io import StringIO

#Input file
sList = '''timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7'''

#Read file in as dataframe
df = pd.read_csv(StringIO(sList), sep='\s\s+', engine='python')

#Reshape the dataframe
df_out = df.set_index(['timestamp', 'sensor']).unstack()

#Flatten multiindex
df_out.columns = [f'{j}_{i}' if i != 'value' else f'{j}' for i, j in  df_out.columns]

df_out = df_out.reset_index()
print(df_out)

Output:

    timestamp sensor_a_type sensor_b_type sensor_a_unit sensor_b_unit sensor_a sensor_b
0  1607724078        string           int           gps           bar  coords1        1
1  1607724079        string           int           gps           bar  coords5        4
2  1607724080        string           int           gps           bar  coords9        7

Transform pandas dataframe from tsv to parquet

3 Answers3

Output