1

I would like to use tsfresh to extract features from a time series, but I am having trouble already with a very basic example. I generate a time series with 100 data points, each of length 100, of synthetic data simulating the function f(x)=x^2 with some noise.

According to this comment, that amount of data should be enough to extract and select relevant features.

The code I'm using is the following:

import numpy as np
import pandas as pd

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters


def sq(x):
    return x ** 2


def err():
    return np.random.normal() / 10


def load_data(n, length, seed=None):
    if seed is not None:
        np.random.seed(seed)

    x = list(range(length))

    df = []
    y = []
    for i in range(n):
        for x_0 in x:
            value = sq(x_0) * (1 + err())
            df.append([i, x_0, value])
        y.append(sq(length) * (1 + err()))

    df = pd.DataFrame(df, columns=['id', 'time', 'value'])
    y = pd.Series(y)

    return df, y


if __name__ == '__main__':
    # Create mock data
    df, y = load_data(100, 100, seed=0)

    print('Shape of df:', df.shape)
    print('Shape of y', y.shape)

    # Feature extraction
    X = extract_features(df, column_id='id', column_sort='time', column_value='value', default_fc_parameters=ComprehensiveFCParameters(), impute_function=impute)
    print('Shape of df after feature extraction: ', X.shape)

    # Feature selection
    X_filtered = select_features(X, y)
    print('Shape of df after feature extraction: ', X_filtered.shape)

And the output it generates is the following:

Shape of df: (10000, 3)
Shape of y (100,)
Feature Extraction: 100%|██████████| 25/25 [00:01<00:00, 14.23it/s]
Shape of df after feature extraction:  (100, 789)
Shape of df after feature selection:  (100, 0)

Following this suggestion, I used the calculate_relevance_table function, which is used when calling select_features to compute the p-values for each feature, and I get all sorts of p-values, however all features are tagged as not relevant. This is the code:

rt = calculate_relevance_table(X, y)

pd.set_option('display.width', 320)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 10)
print(rt)

And this the output (edited for readability):

                                                                                              feature      type   p_value  relevant
feature                                                                                                                            
value__fft_coefficient__attr_"angle"__coeff_49         value__fft_coefficient__attr_"angle"__coeff_49      real  0.009571     False
value__fft_coefficient__attr_"imag"__coeff_17           value__fft_coefficient__attr_"imag"__coeff_17      real  0.011753     False
value__fft_coefficient__attr_"real"__coeff_42           value__fft_coefficient__attr_"real"__coeff_42      real  0.020832     False
value__fft_coefficient__attr_"abs"__coeff_18             value__fft_coefficient__attr_"abs"__coeff_18      real  0.021163     False
value__fft_coefficient__attr_"angle"__coeff_16         value__fft_coefficient__attr_"angle"__coeff_16      real  0.022534     False
value__fft_coefficient__attr_"real"__coeff_18           value__fft_coefficient__attr_"real"__coeff_18      real  0.023249     False
value__fft_coefficient__attr_"angle"__coeff_17         value__fft_coefficient__attr_"angle"__coeff_17      real  0.024357     False
value__ratio_beyond_r_sigma__r_3                                     value__ratio_beyond_r_sigma__r_3    binary  0.031515     False
value__fft_coefficient__attr_"real"__coeff_16           value__fft_coefficient__attr_"real"__coeff_16      real  0.035505     False
value__lempel_ziv_complexity__bins_5                             value__lempel_ziv_complexity__bins_5      real  0.037937     False
value__spkt_welch_density__coeff_2                                 value__spkt_welch_density__coeff_2      real  0.039317     False
value__fft_coefficient__attr_"imag"__coeff_18           value__fft_coefficient__attr_"imag"__coeff_18      real  0.043470     False
value__fft_coefficient__attr_"angle"__coeff_10         value__fft_coefficient__attr_"angle"__coeff_10      real  0.047320     False
value__fft_coefficient__attr_"abs"__coeff_38             value__fft_coefficient__attr_"abs"__coeff_38      real  0.050042     False
value__fft_coefficient__attr_"abs"__coeff_24             value__fft_coefficient__attr_"abs"__coeff_24      real  0.051452     False
value__index_mass_quantile__q_0.6                                   value__index_mass_quantile__q_0.6      real  0.055947     False

[...]

value__fft_coefficient__attr_"abs"__coeff_13             value__fft_coefficient__attr_"abs"__coeff_13      real  0.947761     False
value__change_quantiles__f_agg_"var"__isabs_Tru...  value__change_quantiles__f_agg_"var"__isabs_Tr...      real  0.952504     False
value__fft_coefficient__attr_"angle"__coeff_25         value__fft_coefficient__attr_"angle"__coeff_25      real  0.957249     False
value__autocorrelation__lag_5                                           value__autocorrelation__lag_5      real  0.957249     False
value__first_location_of_maximum                                     value__first_location_of_maximum      real  0.958843     False
value__last_location_of_maximum                                       value__last_location_of_maximum      real  0.958843     False
value__number_peaks__n_3                                                     value__number_peaks__n_3      real  0.961078     False
value__fft_coefficient__attr_"angle"__coeff_20         value__fft_coefficient__attr_"angle"__coeff_20      real  0.961995     False
value__energy_ratio_by_chunks__num_segments_10_...  value__energy_ratio_by_chunks__num_segments_10...      real  0.961995     False
value__agg_linear_trend__attr_"rvalue"__chunk_l...  value__agg_linear_trend__attr_"rvalue"__chunk_...      real  0.961995     False
value__energy_ratio_by_chunks__num_segments_10_...  value__energy_ratio_by_chunks__num_segments_10...      real  0.961995     False
value__number_cwt_peaks__n_1                                             value__number_cwt_peaks__n_1      real  0.966065     False
value__agg_linear_trend__attr_"rvalue"__chunk_l...  value__agg_linear_trend__attr_"rvalue"__chunk_...      real  0.966743     False
value__fft_coefficient__attr_"abs"__coeff_41             value__fft_coefficient__attr_"abs"__coeff_41      real  0.966743     False
value__fft_coefficient__attr_"real"__coeff_11           value__fft_coefficient__attr_"real"__coeff_11      real  0.966743     False
value__fft_coefficient__attr_"abs"__coeff_25             value__fft_coefficient__attr_"abs"__coeff_25      real  0.971492     False
value__agg_linear_trend__attr_"rvalue"__chunk_l...  value__agg_linear_trend__attr_"rvalue"__chunk_...      real  0.971492     False
value__fft_coefficient__attr_"abs"__coeff_47             value__fft_coefficient__attr_"abs"__coeff_47      real  0.976242     False
value__agg_linear_trend__attr_"stderr"__chunk_l...  value__agg_linear_trend__attr_"stderr"__chunk_...      real  0.976242     False
value__change_quantiles__f_agg_"var"__isabs_Fal...  value__change_quantiles__f_agg_"var"__isabs_Fa...      real  0.980992     False
value__fft_coefficient__attr_"imag"__coeff_13           value__fft_coefficient__attr_"imag"__coeff_13      real  0.980992     False
value__change_quantiles__f_agg_"var"__isabs_Tru...  value__change_quantiles__f_agg_"var"__isabs_Tr...      real  0.985744     False
value__quantile__q_0.7                                                         value__quantile__q_0.7      real  0.985744     False
value__sample_entropy                                                           value__sample_entropy      real  0.988119     False
value__fft_coefficient__attr_"angle"__coeff_5           value__fft_coefficient__attr_"angle"__coeff_5      real  0.990495     False
value__cwt_coefficients__coeff_11__w_2__widths_...  value__cwt_coefficients__coeff_11__w_2__widths...      real  0.995248     False
value__fft_coefficient__attr_"abs"__coeff_37             value__fft_coefficient__attr_"abs"__coeff_37      real  0.995248     False
value__fft_coefficient__attr_"imag"__coeff_39           value__fft_coefficient__attr_"imag"__coeff_39      real  0.995248     False
value__fft_coefficient__attr_"real"__coeff_34           value__fft_coefficient__attr_"real"__coeff_34      real  0.995248     False
value__change_quantiles__f_agg_"var"__isabs_Tru...  value__change_quantiles__f_agg_"var"__isabs_Tr...      real  0.995248     False
value__c3__lag_1                                                                     value__c3__lag_1      real  1.000000     False
value__variance_larger_than_standard_deviation         value__variance_larger_than_standard_deviation  constant       NaN     False
value__has_duplicate_max                                                     value__has_duplicate_max  constant       NaN     False
value__has_duplicate_min                                                     value__has_duplicate_min  constant       NaN     False
value__has_duplicate                                                             value__has_duplicate  constant       NaN     False
value__length                                                                           value__length  constant       NaN     False

[...]

value__fourier_entropy__bins_2                                         value__fourier_entropy__bins_2  constant       NaN     False
value__fourier_entropy__bins_3                                         value__fourier_entropy__bins_3  constant       NaN     False
value__fourier_entropy__bins_5                                         value__fourier_entropy__bins_5  constant       NaN     False
value__fourier_entropy__bins_10                                       value__fourier_entropy__bins_10  constant       NaN     False
value__query_similarity_count__query_None__thre...  value__query_similarity_count__query_None__thr...  constant       NaN     False

Shouldn't those features with low p-values be considered as relevant? Citing from the compute_relevance_table documentation:

“p_value” (the significance of this feature as a p-value, lower means more significant)

I understand that the relevance is obtained through a Benjamini–Hochberg procedure, however I tried to compute the relevance table with different values for fdr_level, ranging from 0 to 1, and never got any relevant feature.

Am I missing something? Shouldn't a time series mimicking the function f(x)=x^2 have relevant features (e.g. constant second derivative)?

Thanks in advance!

0 Answers0