I would like to use tsfresh to extract features from a time series, but I am having trouble already with a very basic example. I generate a time series with 100 data points, each of length 100, of synthetic data simulating the function f(x)=x^2
with some noise.
According to this comment, that amount of data should be enough to extract and select relevant features.
The code I'm using is the following:
import numpy as np
import pandas as pd
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters
def sq(x):
return x ** 2
def err():
return np.random.normal() / 10
def load_data(n, length, seed=None):
if seed is not None:
np.random.seed(seed)
x = list(range(length))
df = []
y = []
for i in range(n):
for x_0 in x:
value = sq(x_0) * (1 + err())
df.append([i, x_0, value])
y.append(sq(length) * (1 + err()))
df = pd.DataFrame(df, columns=['id', 'time', 'value'])
y = pd.Series(y)
return df, y
if __name__ == '__main__':
# Create mock data
df, y = load_data(100, 100, seed=0)
print('Shape of df:', df.shape)
print('Shape of y', y.shape)
# Feature extraction
X = extract_features(df, column_id='id', column_sort='time', column_value='value', default_fc_parameters=ComprehensiveFCParameters(), impute_function=impute)
print('Shape of df after feature extraction: ', X.shape)
# Feature selection
X_filtered = select_features(X, y)
print('Shape of df after feature extraction: ', X_filtered.shape)
And the output it generates is the following:
Shape of df: (10000, 3)
Shape of y (100,)
Feature Extraction: 100%|██████████| 25/25 [00:01<00:00, 14.23it/s]
Shape of df after feature extraction: (100, 789)
Shape of df after feature selection: (100, 0)
Following this suggestion, I used the calculate_relevance_table
function, which is used when calling select_features
to compute the p-values for each feature, and I get all sorts of p-values, however all features are tagged as not relevant. This is the code:
rt = calculate_relevance_table(X, y)
pd.set_option('display.width', 320)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 10)
print(rt)
And this the output (edited for readability):
feature type p_value relevant
feature
value__fft_coefficient__attr_"angle"__coeff_49 value__fft_coefficient__attr_"angle"__coeff_49 real 0.009571 False
value__fft_coefficient__attr_"imag"__coeff_17 value__fft_coefficient__attr_"imag"__coeff_17 real 0.011753 False
value__fft_coefficient__attr_"real"__coeff_42 value__fft_coefficient__attr_"real"__coeff_42 real 0.020832 False
value__fft_coefficient__attr_"abs"__coeff_18 value__fft_coefficient__attr_"abs"__coeff_18 real 0.021163 False
value__fft_coefficient__attr_"angle"__coeff_16 value__fft_coefficient__attr_"angle"__coeff_16 real 0.022534 False
value__fft_coefficient__attr_"real"__coeff_18 value__fft_coefficient__attr_"real"__coeff_18 real 0.023249 False
value__fft_coefficient__attr_"angle"__coeff_17 value__fft_coefficient__attr_"angle"__coeff_17 real 0.024357 False
value__ratio_beyond_r_sigma__r_3 value__ratio_beyond_r_sigma__r_3 binary 0.031515 False
value__fft_coefficient__attr_"real"__coeff_16 value__fft_coefficient__attr_"real"__coeff_16 real 0.035505 False
value__lempel_ziv_complexity__bins_5 value__lempel_ziv_complexity__bins_5 real 0.037937 False
value__spkt_welch_density__coeff_2 value__spkt_welch_density__coeff_2 real 0.039317 False
value__fft_coefficient__attr_"imag"__coeff_18 value__fft_coefficient__attr_"imag"__coeff_18 real 0.043470 False
value__fft_coefficient__attr_"angle"__coeff_10 value__fft_coefficient__attr_"angle"__coeff_10 real 0.047320 False
value__fft_coefficient__attr_"abs"__coeff_38 value__fft_coefficient__attr_"abs"__coeff_38 real 0.050042 False
value__fft_coefficient__attr_"abs"__coeff_24 value__fft_coefficient__attr_"abs"__coeff_24 real 0.051452 False
value__index_mass_quantile__q_0.6 value__index_mass_quantile__q_0.6 real 0.055947 False
[...]
value__fft_coefficient__attr_"abs"__coeff_13 value__fft_coefficient__attr_"abs"__coeff_13 real 0.947761 False
value__change_quantiles__f_agg_"var"__isabs_Tru... value__change_quantiles__f_agg_"var"__isabs_Tr... real 0.952504 False
value__fft_coefficient__attr_"angle"__coeff_25 value__fft_coefficient__attr_"angle"__coeff_25 real 0.957249 False
value__autocorrelation__lag_5 value__autocorrelation__lag_5 real 0.957249 False
value__first_location_of_maximum value__first_location_of_maximum real 0.958843 False
value__last_location_of_maximum value__last_location_of_maximum real 0.958843 False
value__number_peaks__n_3 value__number_peaks__n_3 real 0.961078 False
value__fft_coefficient__attr_"angle"__coeff_20 value__fft_coefficient__attr_"angle"__coeff_20 real 0.961995 False
value__energy_ratio_by_chunks__num_segments_10_... value__energy_ratio_by_chunks__num_segments_10... real 0.961995 False
value__agg_linear_trend__attr_"rvalue"__chunk_l... value__agg_linear_trend__attr_"rvalue"__chunk_... real 0.961995 False
value__energy_ratio_by_chunks__num_segments_10_... value__energy_ratio_by_chunks__num_segments_10... real 0.961995 False
value__number_cwt_peaks__n_1 value__number_cwt_peaks__n_1 real 0.966065 False
value__agg_linear_trend__attr_"rvalue"__chunk_l... value__agg_linear_trend__attr_"rvalue"__chunk_... real 0.966743 False
value__fft_coefficient__attr_"abs"__coeff_41 value__fft_coefficient__attr_"abs"__coeff_41 real 0.966743 False
value__fft_coefficient__attr_"real"__coeff_11 value__fft_coefficient__attr_"real"__coeff_11 real 0.966743 False
value__fft_coefficient__attr_"abs"__coeff_25 value__fft_coefficient__attr_"abs"__coeff_25 real 0.971492 False
value__agg_linear_trend__attr_"rvalue"__chunk_l... value__agg_linear_trend__attr_"rvalue"__chunk_... real 0.971492 False
value__fft_coefficient__attr_"abs"__coeff_47 value__fft_coefficient__attr_"abs"__coeff_47 real 0.976242 False
value__agg_linear_trend__attr_"stderr"__chunk_l... value__agg_linear_trend__attr_"stderr"__chunk_... real 0.976242 False
value__change_quantiles__f_agg_"var"__isabs_Fal... value__change_quantiles__f_agg_"var"__isabs_Fa... real 0.980992 False
value__fft_coefficient__attr_"imag"__coeff_13 value__fft_coefficient__attr_"imag"__coeff_13 real 0.980992 False
value__change_quantiles__f_agg_"var"__isabs_Tru... value__change_quantiles__f_agg_"var"__isabs_Tr... real 0.985744 False
value__quantile__q_0.7 value__quantile__q_0.7 real 0.985744 False
value__sample_entropy value__sample_entropy real 0.988119 False
value__fft_coefficient__attr_"angle"__coeff_5 value__fft_coefficient__attr_"angle"__coeff_5 real 0.990495 False
value__cwt_coefficients__coeff_11__w_2__widths_... value__cwt_coefficients__coeff_11__w_2__widths... real 0.995248 False
value__fft_coefficient__attr_"abs"__coeff_37 value__fft_coefficient__attr_"abs"__coeff_37 real 0.995248 False
value__fft_coefficient__attr_"imag"__coeff_39 value__fft_coefficient__attr_"imag"__coeff_39 real 0.995248 False
value__fft_coefficient__attr_"real"__coeff_34 value__fft_coefficient__attr_"real"__coeff_34 real 0.995248 False
value__change_quantiles__f_agg_"var"__isabs_Tru... value__change_quantiles__f_agg_"var"__isabs_Tr... real 0.995248 False
value__c3__lag_1 value__c3__lag_1 real 1.000000 False
value__variance_larger_than_standard_deviation value__variance_larger_than_standard_deviation constant NaN False
value__has_duplicate_max value__has_duplicate_max constant NaN False
value__has_duplicate_min value__has_duplicate_min constant NaN False
value__has_duplicate value__has_duplicate constant NaN False
value__length value__length constant NaN False
[...]
value__fourier_entropy__bins_2 value__fourier_entropy__bins_2 constant NaN False
value__fourier_entropy__bins_3 value__fourier_entropy__bins_3 constant NaN False
value__fourier_entropy__bins_5 value__fourier_entropy__bins_5 constant NaN False
value__fourier_entropy__bins_10 value__fourier_entropy__bins_10 constant NaN False
value__query_similarity_count__query_None__thre... value__query_similarity_count__query_None__thr... constant NaN False
Shouldn't those features with low p-values be considered as relevant? Citing from the compute_relevance_table
documentation:
“p_value” (the significance of this feature as a p-value, lower means more significant)
I understand that the relevance is obtained through a Benjamini–Hochberg procedure, however I tried to compute the relevance table with different values for fdr_level
, ranging from 0 to 1, and never got any relevant feature.
Am I missing something? Shouldn't a time series mimicking the function f(x)=x^2
have relevant features (e.g. constant second derivative)?
Thanks in advance!