0

I have a dataset with a 1-second timestep in the form of a CSV with the following format:

2021-02-07 11:00:30,    64.8
2021-02-07 11:00:31,    64.8
2021-02-07 11:00:35,    50.3
..
..

and so on and so forth. The problem is, it only has entries for when the value at that time is larger than zero, all the values that are equal to zero are simply not recorded. So the dataset starts at the first nonzero value at 2021-02-07 11:00:30

I'm trying to create another, larger CSV that has the time from 2021-02-07 00:00:00 to 2021-02-07 23:59:59 (covering the entire 24-hour period, 1-second steps) and basically fill all the values that are missing from the original CSV with zeroes. Something like this:

..
..
2021-02-07 11:00:30,    64.8
2021-02-07 11:00:31,    64.8
2021-02-07 11:00:32,    0
2021-02-07 11:00:33,    0
2021-02-07 11:00:34,    0
2021-02-07 11:00:35,    50.3
..
..

I'm not sure how to implement that. Can anyone please advise? P.S: Please note that the values are completely missing, there are no rows for them. It's not a NaN situation.

mikasa
  • 1
  • 1
  • Does this answer your question? [Replacing blank values (white space) with NaN in pandas](https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas) – mightyandweakcoder Feb 25 '21 at 14:00

1 Answers1

0

If the dates in your df are not your index, then you have to set them as your index.

To set as index:

df = df.set_index('Datecolumn')

After that try this:

df.asfreq(freq='1S', fill_value=0)
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • I'm getting this error after entering the second command. ValueError: cannot reindex from a duplicate axis – mikasa Feb 25 '21 at 20:31
  • does your index have duplicate values? – rhug123 Feb 25 '21 at 20:58
  • It would appear so. `print(df[df.index.duplicated()])` <- I ran this and got an output that's a little weird. The values don't look like duplicates to me. I guess I don't really understand what a duplicated index is or what I'm supposed to do to resolve the issue. This is a snippet of the output to that command https://imgur.com/a/h2eMcNH – mikasa Feb 25 '21 at 21:03
  • I believe by default `duplicated()` keeps the first occurrence. put this line of code in between the other two above and see if it works. `df = df.loc[~df.index.duplicated()]` – rhug123 Feb 25 '21 at 21:11
  • Okay, so duplicated indices are no longer an issue, but nothing has changed. Only the values in the middle of the CSV exist (and no missing values got filled out) but the rest of the 24 hour values aren't covered. https://imgur.com/a/0WADA5T – mikasa Feb 25 '21 at 21:23
  • Its hard to tell, but one issue could be that your index is not a datetime index. before you set your date column as an index, try doing this `df['datecol'] = pd.to_datetime(df['datecol'])`. – rhug123 Feb 25 '21 at 22:39
  • That didn't work either :/ sorry. I appreciate your help. – mikasa Feb 25 '21 at 22:51
  • Thats weird... i dont know why its not working... in my last comment i was saying to put it in between the two lines of code above. It actually needs to go first. So we want to convert to datetime, set that datetime as the index, then use `asfreq` to resample. It worked when I tested it. – rhug123 Feb 26 '21 at 14:19
  • Here is the link to `asfreq` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html – rhug123 Feb 26 '21 at 14:19