2

I have following situation where i may get 300 columns in csv file and some of them are list parameter (50 columns)and they can uneven size including no values.

    Time               COL1                        COL2
2020-03-13 10:43:00.500  0.0 10.0 20.0 30.0 40.0   50.0 60.0 70.0 80.0 90.0
2020-03-13 10:43:00.900  10.0 20.0 30.0 70.0       10.0 20.0
2020-03-13 10:44:00.100  10.0 20.0 30.0 70.0   

I want do
1. explode the column based on highest frequency to rows.ex: 1st row max freq(space separated values) is 5 and in 2nd row it is 4

Time                  COL1      COL2
2020-03-13 10:43:00.500   0.0    50.0
2020-03-13 10:43:00.580   10.0   60.0
2020-03-13 10:43:00.660   20.0   70.0
2020-03-13 10:43:00.740   30.0   80.0
2020-03-13 10:43:00.820   40.0   90.0    
2020-03-13 10:43:00.900   10.0   10.0 
2020-03-13 10:43:00.1150  20.0   20.0 
2020-03-13 10:43:00.1400  30.0 
2020-03-13 10:43:00.1650  80.0 
2020-03-13 10:44:00.100   10.0 
2020-03-13 10:44:00.350   20.0 
2020-03-13 10:44:00.600   30.0
2020-03-13 10:44:00.850   70.0
  1. and Time filed should be between two subsequent row time. something like this.

    curr_row_time = data_frame['Time'][ind1]
    next_row_time = data_frame['Time'][ind1+1] timestamp1 = datetime.datetime.strptime(str(curr_row_time ), '%Y-%m-%d %H:%M:%S.%f') timestamp2 = datetime.datetime.strptime(str(next_row_time), '%Y-%m-%d %H:%M:%S.%f') time_delta = (timestamp2 - timestamp1).total_seconds() * 1000 time_step_increment = time_delta / max_frequency (should be max for a row) for last row max frequency can be taken either avg of it or last used max frequency Appreciate if any suggestion for point 1 however 2nd point is on second priority but its good to have that also.

For point 1 i have used below

I tried solution mentioned in the below links by converting space separated to list 10.0 20.0 30.0 70.0 ----> [10.0, 20.0, 30.0, 70.0] Pandas Explode on Multiple columns

 File "<stdin>", line 3, in <listcomp>
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: zero-dimensional arrays cannot be concatenated

Below links works for equal size list column Split (explode) pandas dataframe string entry to separate rows

Vikram Ranabhatt
  • 7,268
  • 15
  • 70
  • 133

1 Answers1

1

First you can create the exploded columns like you want with concat, str.split and stack. Use reset_index and join to be able to get the column 'Time' associated. Then you need to change the values in the column 'Time' to create the interpolation. I'm not sure if one can interpolate directly a datetime column, but you can change the type to int64, mask the values if same than previous row with shift and interpolate. So like this:

l_col = ['COL1', 'COL2']

df_f = pd.concat([df[col].str.split(' ', expand=True) for col in l_col ], 
                 axis=1, keys=l_col)\
         .stack()\
         .reset_index(level=1, drop=True)\
         .join(df[['Time']])\
         .reset_index(drop=True)

df_f['Time'] = pd.to_datetime(df_f['Time'].astype('int64')
                                          .mask(df_f.Time.eq(df_f.Time.shift()))
                                          .interpolate(method='linear'))

print (df_f)
    COL1  COL2                    Time
0    0.0  50.0 2020-03-13 10:43:00.500
1   10.0  60.0 2020-03-13 10:43:00.580
2   20.0  70.0 2020-03-13 10:43:00.660
3   30.0  80.0 2020-03-13 10:43:00.740
4   40.0  90.0 2020-03-13 10:43:00.820
5   10.0  10.0 2020-03-13 10:43:00.900
6   20.0  20.0 2020-03-13 10:43:15.700
7   30.0  None 2020-03-13 10:43:30.500
8   70.0  None 2020-03-13 10:43:45.300
9   10.0       2020-03-13 10:44:00.100
10  20.0  None 2020-03-13 10:44:00.100
11  30.0  None 2020-03-13 10:44:00.100
12  70.0  None 2020-03-13 10:44:00.100

I'm not sure what you want for the missing values in COL2 e.g. so you may need some fillna to work this out.

Ben.T
  • 29,160
  • 6
  • 32
  • 54