0

Let's suppose that I have a dataset which consists of the following columns:

  • Stock_id: the id of a stock
  • Date: a date of 2018 e.g. 25/03/2018
  • Stock_value: the value of the stock at this specific date

I have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.

By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value at the rows is NA etc.

A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.

By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.

What is the most efficient way to do this?

I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.

Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.

Let's suppose that I initially have the following:

Stock_id    Date    Stock_value
1   01/01/2018  124
1   02/01/2018  130
1   03/01/2018  136
1   05/01/2018  129
1   06/01/2018  131
1   07/01/2018  133
2   03/01/2018  144
2   04/01/2018  148
2   06/01/2018  150
2   07/01/2018  147

Thus for Stock_id = 1 the date 04/01/2018 is missing.

For Stock_id = 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).

Hence, I would like to have the following as output:

Stock_id    Date    Stock_value
1   01/01/2018  124
1   02/01/2018  130
1   03/01/2018  136
1   04/01/2018  NA
1   05/01/2018  129
1   06/01/2018  131
1   07/01/2018  133
2   03/01/2018  144
2   04/01/2018  148
2   05/01/2018  NA
2   06/01/2018  150
2   07/01/2018  147
num3ri
  • 822
  • 16
  • 20
Outcast
  • 4,967
  • 5
  • 44
  • 99
  • @jezrael, hm from what I see yes this answers my question (even though it it a bit too dense to understand it instantly). However, does your solution work if you take into account also this limitation described at my post above: "A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent."? I cannot really tell from reading your solution; I may just simply test it. – Outcast Feb 20 '19 at 12:06
  • @jezrael, yes I tested it. It definitely does this falsely. It returns all the dates of 2018. Actually, since you are not creating any dates timeseries then it simply returns the dates which it sees in the dataset. So for example if the date 21/05/2018 is missing entirely from the dataset then it does not return it while it should (at least for the stocks existing after this date). – Outcast Feb 20 '19 at 12:22
  • (This is why I guess we should not rush to mark a post as duplicate here but anyways...) – Outcast Feb 20 '19 at 12:23
  • @jezrael, to start with I am not dealing with stocks so I will also give you some imaginary data. Secondly, there is nothing so surprising at the data; you can create a small sample by yourself according to the description given above. Of course I can create this sample to save you some time since I am asking this question but I just said this to clarify that in principle you can create a small sample of data by yourself. – Outcast Feb 20 '19 at 12:26
  • @jezrael, here is your example. Does this help? – Outcast Feb 20 '19 at 12:41
  • @jezrael, the problem is that I cannot copy-paste the values in any other way from my Excel to here. The example is easy and imaginary so need to copy the data; just it is useful to understand what I am talking about. – Outcast Feb 20 '19 at 12:48
  • @jezrael, yes this is what I also thought. So something like this I guess above? – Outcast Feb 20 '19 at 12:50
  • @jezrael, by the way, please keep in mind what I said above about your solution at the other post applied to mine: "Actually, since you are not creating any dates timeseries then it simply returns the dates which it sees in the dataset. So for example if the date 21/05/2018 is missing entirely from the dataset then it does not return it while it should (at least for the stocks existing after this date)". Therefore, I guess that for each stock you have to create a date series with starting date the starting date of the stock and with ending date e.g. the 31/01/2018 to deal with the problem... – Outcast Feb 20 '19 at 12:57

2 Answers2

1

Use asfreq per groups, but if large data performance should be problematic:

df = (df.set_index( 'Date')
        .groupby('Stock_id')['Stock_value']
        .apply(lambda x: x.asfreq('D'))
        .reset_index()
        )
print (df)
    Stock_id       Date  Stock_value
0          1 2018-01-01        124.0
1          1 2018-01-02        130.0
2          1 2018-01-03        136.0
3          1 2018-01-04          NaN
4          1 2018-01-05        129.0
5          1 2018-01-06        131.0
6          1 2018-01-07        133.0
7          2 2018-01-03        144.0
8          2 2018-01-04        148.0
9          2 2018-01-05          NaN
10         2 2018-01-06        150.0
11         2 2018-01-07        147.0

EDIT:

If want change values by minimal datetime per group with some scalar for maximum datetime, use reindex with date_range:

df = (df.set_index( 'Date')
        .groupby('Stock_id')['Stock_value']
        .apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
        .reset_index()
        )
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The dates are good now but as I said the 'Stock_value' column has everywhere NAs and I think that some stocks are entirely missing. – Outcast Feb 20 '19 at 14:00
  • hmmm, so it seems data related problem. – jezrael Feb 20 '19 at 14:00
  • May be but the original dataset is loaded properly and everything is shown alright so I am not what kind of data problem this will be. By the way, the stocks missing are two and from what I see these have a quite small number of observations at the original dataset so I do not know how and why this may play a role (perhaps in relation to x.asfreq() ). – Outcast Feb 20 '19 at 14:06
  • Stock_id float64, Date object, Stock_value float64. – Outcast Feb 20 '19 at 14:08
  • Yes, `Date` cannot be object, but datetimes. Do you try `df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)` before my solution? – jezrael Feb 20 '19 at 14:09
  • 1
    Yes this made it work. Now the values are showing properly. The only problem is that if for some reason the dates for a stock are ending at e.g. 03/12/2018 then the solution does not expand them up to e.g. 31/12/2018. In few words, the limitation is to not generate dates for days when the stock did not exist but it should generate dates to date (since the stock is still live) or to a specific date e.g. 31/12/2018. But I guess that I am asking too much? In any case, thank you for your help thus far; you are very patient with me and with people in general I guess. – Outcast Feb 20 '19 at 14:20
  • (Actually I think that I have another dataset with the full dates of each stock but with different columns so perhaps I will go for a left/right join with the one above; of course it interesting for me to have a solution or a good approximate of it to my problem as yours) – Outcast Feb 20 '19 at 14:22
  • i think it generate dates between min and max datetime per groups – jezrael Feb 20 '19 at 14:31
  • Yes exactly. I would like to generate dates from the min per group but to a universal specified max e.g. to date 20/02/2019 if for some reason I do not have these dates. (But as I said it is not perhaps necessary to solve this because I have another dataset with dates to join it with mine; also I think that you will have to considerably revise your solution to include this too.) – Outcast Feb 20 '19 at 14:59
  • Cool no worries:). This is what I am getting for now: `.apply(lambda x: x.reindex(x.min(), '2019-02-20')) TypeError: reindex() takes from 1 to 2 positional arguments but 3 were given`. – Outcast Feb 20 '19 at 15:16
  • This goes back to 1970 for each group for some reason!! :) Unless I am missing something... – Outcast Feb 20 '19 at 15:40
  • yes, `.apply(lambda x: x.reindex(pd.date_range(x.index. min(), '2019-02-20')))`? – jezrael Feb 20 '19 at 15:49
  • I think that it works; You look like you are a pandas genius. Thanks ;) – Outcast Feb 20 '19 at 16:54
  • So you may add this to your answer I guess. – Outcast Feb 20 '19 at 16:54
  • By they way, it would be good to add to your answer what we discussed here in the comments or at least make a note at your answer that the discussion at the comments contains some valuable information :) – Outcast Feb 26 '19 at 13:12
0

df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()

heyu91
  • 1,174
  • 11
  • 18