1

I have a dataframe. I would like to extract features based on a time window.

df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8,9,10,2,3,5,6,8,10,12],
                   'id':[793,793,793,793,793,793,793,793,793,793,942,942,942,942,942,942,942],
                   'B1':[10,20,30,40,50,60,70,80,90,100,23,24,25,27,30,44,55],
                   'B2':[10,20,30,40,50,60,70,80,90,100,23,24,25,27,30,44,55],
                   'B3':[10,20,30,40,50,60,70,80,90,100,23,24,25,27,30,44,55]})
time_window = pd.DataFrame({'time':[2,4,6,8,5,8], 'id':[793,793,793,793,942,942]})

Here, my time window is

        [2,4]--> for participant 793
        [6,8]--> for participant 793
        [5,8]--> for participant 942

My goal is to extract the features on the specified time window for each participant. Therefore, I wrote a function

from tsfresh import extract_features

def apply_tsfresh(col):
  for i in range(len(time)):
    col.loc[time_window[i]:time_window[i+1]] = extract_features(col.loc[time_window[i]:time_window[i+1]], column_id="id")
    return col 

extracted_freatures = df.set_index('time').apply(apply_tsfresh)

It will extract the features based on the specified time window for each participant. However, I am not getting any results. It provides me an error.

Could you please help me here? I am totally out of any ideas.

My desired output should be look like as: desired result

*Here, the extracted features maybe more than just two. Also the extracted features values maybe different. I am just giving you an example.

1 Answers1

0

Initially, an empty dataframe is created 'extracted_freatures_'. A cycle is created, step two. Elements are taken from the dataframe 'time_window' column 'time'. The results from 'extract_features' are attached to the 'extract_features' dataframe. Don't ask me how 'tsfresh' works, I don't know.

extracted_freatures_ = pd.DataFrame()

df = df.set_index('time')

for i in range(0, len(time_window['time']), 2):
    ind1 = time_window.loc[i, 'time']
    ind2 = time_window.loc[i+1, 'time']
    a = extract_features(df.loc[[ind1, ind2]], column_id="id")
    extracted_freatures_ = pd.concat([extracted_freatures_, a])

print(extracted_freatures_)

Output

Feature Extraction: 100%|██████████| 6/6 [00:00<00:00, 36.71it/s]
Feature Extraction: 100%|██████████| 6/6 [00:00<00:00, 39.50it/s]
Feature Extraction: 100%|██████████| 6/6 [00:00<00:00, 40.81it/s]
     B2__variance_larger_than_standard_deviation  ...  B3__mean_n_absolute_max__number_of_maxima_7
793                                          1.0  ...                                          NaN
942                                          0.0  ...                                          NaN
793                                          1.0  ...                                          NaN
942                                          1.0  ...                                          NaN
793                                          1.0  ...                                          NaN
942                                          1.0  ...                                          NaN

[6 rows x 2367 columns]
inquirer
  • 4,286
  • 2
  • 9
  • 16
  • Can you tell me why you wrote `df.loc[[ind1,ind2]]` in `a = extract_features(df.loc[[ind1, ind2]], column_id="id")`. What actually it does in the double bracket? – pythonhater Jun 06 '22 at 14:37
  • I updated the questions with my desired output appearance. Please have a look on it. – pythonhater Jun 06 '22 at 14:56
  • fancy indexing you can see it [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.07-fancy-indexing.html). If you print df.loc[[ind1, ind2]], you will see that the rows is more than two. This extract_features returns two lines at each iteration. – inquirer Jun 06 '22 at 16:18
  • @pythonhater you have duplicate indexes in 'df', so those indexes after index 10 will also be used. Do you need to access up to 10 index? – inquirer Jun 06 '22 at 17:04