1

I'm working on a project in pandas on python. I receive as input a .csv file like this:

Name   Timestamp       Data
A1       259           [1.1,1.0,0.1]
A1       260           [-0.1,1.2,0.3]
A1       261           [0.1,0.2,-0.3]
...
A1       14895         [1.4,0.3,1.8]
...      
A2       278           [-1.1,1.2,0.4]
A2       353           [-0.1,1.2,0.3]
A2       409           [-0.1,1.2,0.3]
...
A2       14900         [-0.1,1.2,0.3]
...
A1140    107           [-0.5,-1.0,-1.0]
A1140    107           [0.6,0.1,0.3]
A1140    114           [-1.1,-1.2,0.3] 
... 
A1140    14995         [-1,1.2,0.4]

I've 1140+ names and hundreds/thousands of data for each name. Data was recorded at 200 Hz and I think that the timestamp numbers indicates milliseconds, though I' m not sure, i don't have access to this information. I've to resample to 50 Hz frequency.

How can I do this? Do I need to convert Timestamp into actual seconds and then use the .resample() function with 0.25s? And should i use a .groupby["Name"] function? Thank you in advance!

liakoyras
  • 1,101
  • 12
  • 27
Kraton
  • 65
  • 1
  • 8
  • 1
    200 Hz means one sample every 5 miliseconds. Your data does not reflect that, there are samples in various timestamps with a difference not equal to 5 ms. – liakoyras Sep 08 '19 at 18:14
  • I'm not sure that are milliseconds, i edited, thanks. – Kraton Sep 08 '19 at 18:18
  • 1
    How did you acquire the 200 Hz info? The samplings seem to be very inconsistent, so stating that the sampling happened with a certain frequency can be confusing at least. – liakoyras Sep 08 '19 at 18:25
  • For example, in A1140 you have two measurements with the same timestamp. – liakoyras Sep 08 '19 at 18:30
  • 1
    Also, what is your goal? Do you want a file of the same format but in a different frequency, or (for example) a different file for each name? – liakoyras Sep 08 '19 at 18:36
  • I received the 200 Hz info from the guy who worked on the previous part of the project and gave me this.csv. The A1140 is not the only case of two measurements with the same timestamp, for example in A1 i've 4 measurements which correspond at the number 330 in timestamp For the goal yes, i want the same file in a different frequency (i suggested to use .groupby["Name"] because i want to try to have almost the same number of datas for each name) – Kraton Sep 08 '19 at 18:46
  • Possible duplicate of [Resample a dataframe with a column of lists](https://stackoverflow.com/questions/57872277/resample-a-dataframe-with-a-column-of-lists) – liakoyras Sep 12 '19 at 06:42

1 Answers1

1

I cannot answer the question exactly in its entirety as not even you are sure about the timestamp, but I will try to give you some general guidelines.
What you have here is called panel data, many different time series for each "name".
groupby(['Name']).apply(<func>) can indeed be a useful method, as it allows for manipulation of each of the different names separately, allowing you to work with the simpler data type of a time series.
A time series is data of the type:

Date                  Value
2000-01-01 00:00:00   3
2000-01-01 00:03:00   12
2000-01-01 00:06:00   21

As you can see, the time period in which each sample is taken, is 3 minutes. We could call resample() and convert it to 10 minutes like this:

series.resample('10T').mean()

Note that instead of mean you could use .apply(<func>) to choose the downsampling method. For more info on the frequency, consider this question.


To conclude, your best bet would be to try and find out what exactly is timestamp, convert it to a DateTime and then use either
df.groupby(['Name']).resample('20L').mean()

or with a for loop iterate through each name and use resample to each series individually.

liakoyras
  • 1,101
  • 12
  • 27
  • Thank you so much! In my case the "Timestamp" corresponds number of seconds (or milli/nano/micro seconds) from the start of the recorded activity. So hypothetically IF the numbers in my timestamp are expressed in milliseconds at 200hz first i've to: df ['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='ms') (i hope it's correct) and then df.groupby(['Name']).resample('20L').mean() right? (the 20L is from 200 to 50hz in ms, right?) – Kraton Sep 08 '19 at 19:42
  • 1
    Yeah, this seems right, however only you have access to the data in order to test the code, so unexpected problems might occur. – liakoyras Sep 08 '19 at 19:44
  • 1
    Yes, i think that the datas are kinda wrong, i'll ask. Last question, for datas in nanoseconds, the conversion at 50hz will be .resample('20000000N'), right? – Kraton Sep 08 '19 at 19:55
  • 1
    You do not need to specify this. `.resample('20000000N')` should be exactly the same as `.resample('20L')` and pandas will do the rest. – liakoyras Sep 08 '19 at 19:57
  • Oh wow, so i only have to change the unit in pd.to_date(df['Timestamp'], unit = 'N') (because i see don't see the ns in the list in the other topic but there is N for nanoseconds) and then i can keep .resample('20L')? – Kraton Sep 08 '19 at 20:03
  • I would suggest looking at the documentation of to_date in order to find the right unit. – liakoyras Sep 08 '19 at 20:09
  • One last question (for real this time), after the to_date and the resample how can i convert back the "Timestamp" column to numbers? Will pandas do the job automatically when i save the changes with the df.to_csv? – Kraton Sep 08 '19 at 20:34
  • 1
    This depends on your output (which is difficult if not impossible to predict). Converting from a timestamp to a number (of milliseconds or nanoseconds or whatever) should not be difficult, you will easily find an answer if you search. – liakoyras Sep 08 '19 at 20:37