0

I have a question regarding how to fill missint date values in a pandas dataframe. I found a similar question ( pandas fill missing dates in time series )

but this doesn't answer my actual question.

I have a dataframe looking something like this:

date            amount   person   country
01.01.2019      10       John     IT
01.03.2019      5        Jane     SWE
01.05.2019      3        Jim      SWE
01.05.2019      10       Jim      SWE
02.01.2019      10       Bob      UK
02.01.2019      10       Jane     SWE
02.03.2019      10       Sue      IT

As you can see, there are missing values in the dates. What I need to do is to fill the missing date-values and fill remaining column values with the values from the previous line, EXCEPT for the column 'amount', which I need to be a 0, otherwise I would falsify my amounts.

I know there is a command for that in Pandas ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html ) but I'm not sure how to apply that to filling missing values.

data = data.reindex(pd.date_range("2019-01-01", "2019-01-03")) 

method='backfill') , fill_value="0") ?

The expected output would be as follows:

date            amount   person   country
01.01.2019      10       John     IT
01.02.2019      0        Jane     SWE
01.03.2019      5        Jane     SWE
01.04.2019      0        Jane     SWE
01.05.2019      3        Jim      SWE
01.05.2019      10       Jim      SWE
02.01.2019      10       Bob      UK
02.01.2019      10       Jane     SWE
02.02.2019      0        Jane     SWE
02.03.2019      10       Sue      IT

I would appreciate any help on that regard.

Thank you and BR

Melany
  • 31
  • 2

1 Answers1

0

I think simpliest is replace missing values in 2 steps:

data = data.reindex(pd.date_range("2014-11-01", "2019-10-31")) 
data['amount'] = data['amount'].fillna(0)
data = data.bfill()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you a lot for your fast reply. I tried that, and the result was partially satisfying. My 'amount' column is filled with 0's as expected, but the other columns are filled with 'NaN'. Also, for some reason - my dataframe got significantly (from 2+ million entries to 1826). Could the reason for that be, that I'm not actually using the date in my initial dataframe ? – Melany Oct 14 '20 at 11:20
  • @Melany - Is there `DatetimeIndex` in input `df` ? What is `print (df.index.max())` and `print (df.index.min())` ? – jezrael Oct 14 '20 at 11:26
  • I set the date as the index now, before it was just an ascending number. now I tried again and I'm getting the error "ValueError: cannot reindex from a duplicate axis" which seems to be a common issue. data.index.min(): 2014-01-11 00:00:00 data.index.max(): 2019-12-10 00:00:00 – Melany Oct 14 '20 at 11:42
  • @Melany - Error means there are duplicated dates, not like in sample data bu e.g. if change `01.02.2019` to `01.01.2019` like first date, how values are processing? – jezrael Oct 14 '20 at 11:44
  • Unfortunately this did not change anything. Is it possible to solve this with the reindex command? Because I can imagine there being issues with duplicate index values. – Melany Oct 14 '20 at 11:50
  • @Melany - I edit your sample data like in your real data (duplicated datetimes), how looks expected output? Can you add it to question by [edit](https://stackoverflow.com/posts/64350787/edit) ? – jezrael Oct 14 '20 at 11:53
  • I realised my initial post was a bit misleading, so I edited it to mimic my actual data more accurately. – Melany Oct 14 '20 at 12:06
  • @Melany - Can you add expected output? How looks data in final dataframe? – jezrael Oct 14 '20 at 12:16
  • I inserted the expected outlook in my initial post, thank you for noting that. – Melany Oct 14 '20 at 12:29
  • @Melany - Is output correct? I cannot see logic here - why some rows are expanded in january, another in february? There is not necessary grouping by `person, country` and expand datetimes? What date range is used in sample data - expected output? – jezrael Oct 14 '20 at 13:05