2

What I am trying to do is trying to detect weather a dataset is time series or not? I want to automate this process.

Let's say I have the below datasets as:

df1:

Heading 1 Heading 2 Heading 1 Heading 2
1/1/2023 34 12 34
2/1/2023 42 99 42
3/1/2023 42 99 42
4/1/2023 42 99 42

df2:

Heading 1 Heading 2 Heading 1 Heading 2
1/1/2023 34 12 34
3/1/2023 42 99 42
4/1/2023 42 99 42
7/1/2023 42 99 42

df3:

Heading 1 Heading 2 Heading 1 Heading 2
Jan 2023 34 12 34
Feb 2023 42 99 42
Mar 2023 42 99 42

df4:

Heading 1 Heading 2 Heading 1 Heading 2
2020 34 12 34
2021 42 99 42
2022 42 99 42

df1 has time column which is evenly spaced, df2 has time column but it is not evenly spaced and df3 and df4 have a time column which is not in the format of datetime

Out of the above df, which one is a time series data and which is not? What exactly is the criteria for a dataset to be considered as time series?

Thanks!

lowkey
  • 140
  • 10
  • 1
    You can define a time series as any series that has a timestamp index. Noone says the timestamps have to be evenly spaced or have to include day of month or even months. If it has any term that can be interpreted as a time or part of time it is a timeseries. You however, can choose whether a timeseries matches your own particular requirement of timestamp format – Galo do Leste Jan 13 '23 at 04:58
  • @GalodoLeste So out of the above 4 `df`, all of them are time series data right as all of them have a timestamp? – lowkey Jan 13 '23 at 05:10
  • Technically, yes. May I ask the reason for the question. Is it just a philosophical question or is some programming library trying to impose a format on you? – Galo do Leste Jan 13 '23 at 05:14
  • I would also add a caveat that, ideally, every timestamp in the index be unique, otherwise, it could get tricky to work with. But this is getting into debatable territory. – Galo do Leste Jan 13 '23 at 05:18
  • @GalodoLeste I want to automate the process of detecting weather a dataset is timeseries or not for some web application that I am building. Of course I can take the input from the user but I want to minimise the user interaction as much as possible. – lowkey Jan 13 '23 at 05:25
  • OK! I would suggest searching the datasets index for any non-date or non-date_convertible (the index could be timestamps in string format) data. – Galo do Leste Jan 13 '23 at 05:29
  • @GalodoLeste But it's not neccessary that the time column will be present as index. It can be present at any index in the `df`. So how would I know then? – lowkey Jan 13 '23 at 05:50
  • The time data can appear as a column rather than in the index but really should be the index. If it is not, you probably should make it the index as that will make manipulating the dataframe for your needs much easier. – Galo do Leste Jan 13 '23 at 05:54
  • @GalodoLeste Yes I know that but how would I search for the time column in the dataframe. As shown in the 4 `df`, the time column can either be `datetime`, `str` or `int`. There is no fixed format. – lowkey Jan 13 '23 at 05:58
  • 1
    That could be problematic given the innumerable number of string formats timestamps can take. I would suggest looking at https://stackoverflow.com/questions/25341945/check-if-string-has-date-any-format. Then perhaps cycling through the columns to test for format compliance. If possible it would be best if somehow you can impose a format on the data such the timestamps do appear in the index. Otherwise it could take a while to run – Galo do Leste Jan 13 '23 at 06:07

2 Answers2

2

As @GalodoLeste indicates, your dataframes are time series:

df1['Heading 1'] = pd.to_datetime(df1['Heading 1'], dayfirst=True)
df2['Heading 1'] = pd.to_datetime(df2['Heading 1'], dayfirst=True)
df3['Heading 1'] = pd.to_datetime(df3['Heading 1'])
df4['Heading 1'] = pd.to_datetime(df4['Heading 1'], format='%Y')

but third has a frequency and one not:

>>> df1['Heading 1'].dt.freq
'D'

>>> df2['Heading 1'].dt.freq
None

>>> df3['Heading 1'].dt.freq
'MS'

>>> df4['Heading 1'].dt.freq
'AS-JAN'
Corralien
  • 109,409
  • 8
  • 28
  • 52
2

Let's assume this example:

  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5 Heading 6 Heading 7
0  1/1/2023         34         12         34       2000  Jan 2023  1/1/2023
1  2/1/2023         42         99         42       2001  Feb 2023       NaN
2  3/1/2023         42         99         42       2002  Mar 2023       NaN
3  4/1/2023         42         99         42       2003       NaN       NaN

You can try to convert to_datetime with the default automated detection performed by pandas (that is very efficient!).

def find_datelike_cols(df):
    return df.columns[df.astype(str).apply(pd.to_datetime, errors='coerce').notna().any()]

cols = find_datelike_cols(df)
print(cols)

Output:

Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

You can also add a minimal number of matching rows as threshold to determine that a column is datetime-like:

def find_datelike_cols(df, thresh=None):
    mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
    return df.columns[mask.sum()>=thresh if thresh else mask.any()]

find_datelike_cols(df)
# Index(['Heading 1', 'Heading 5', 'Heading 6', 'Heading 7'], dtype='object')

find_datelike_cols(df, thresh=3)
# Index(['Heading 1', 'Heading 5', 'Heading 6'], dtype='object')
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Instead of `mask.sum()`, you can choose `mask.mean()` and a threshold between 0 and 1? – Corralien Jan 13 '23 at 07:57
  • I reused your answer [here](https://stackoverflow.com/a/75106273/15239951) :-P – Corralien Jan 13 '23 at 07:59
  • Yes of course, I hesitated to add this but didn't want to overcomplicate – mozway Jan 13 '23 at 08:02
  • At least you have the same behavior as `dropna` :-) – Corralien Jan 13 '23 at 08:05
  • @mozway The first function `find_datelike_cols` is giving wierd results. I had a sales column in my df which had float values. It shouuld not be included in the result but `pd.to_datetime` is treating it as datelike – spectre Jan 14 '23 at 06:05
  • @spectre then you might need some tweaking. Are the potential date columns always string in the input? – mozway Jan 14 '23 at 06:22
  • @mozway what do you mean string in the input? I assume that since we are converting all the columns in the df to string, they would be! – spectre Jan 14 '23 at 12:22
  • @mozway I am facing a similar problem. The function `find_datelike_cols` treats float values as datlike – lowkey Jan 14 '23 at 12:30