1

I want to check in a column if a certain month of data is missing. My column has all the dates for two years between 2016 and 2017 except June 2017. This column is a purchase date column where when a purchase is made the date is recorded. Only for the month of June 2017, it has not recorded any purchases so I want to do a missing value check and report this missing month in the EDA for my project. The column dtype is datetime64[ns]

I have used the following code but it shows that all the months are missing as the value returned is true. The output should yeild all the missing months from this column in the format MM-YYYY

df1.reindex(pd.period_range(df1.booking_date.min(),   
                            df1.booking_date.max(), 
                            freq= 'M')).isnull().all(1)

The sample data set looks like following:

booking_date
2016-01-16 00:00:00
2016-02-14 00:00:00
2016-03-01 00:00:00
.....
2017-05-01 00:00:00
2017-07-10 00:00:00

so the june month is missing and I want to extract that

Django0602
  • 797
  • 7
  • 26
  • Can you provide a sample of data that reproduces the issue? It's hard to understand without seeing an example. Thanks – cs95 Oct 26 '19 at 19:23
  • Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard/52413247#52413247) – Trenton McKinney Oct 26 '19 at 19:29
  • Update the data set for reference – Django0602 Oct 26 '19 at 19:32
  • How do you expect for this to be shown? What does your expected output look like? – Trenton McKinney Oct 26 '19 at 19:44
  • It should return June-2017 or 06-2017 – Django0602 Oct 26 '19 at 19:45
  • 1
    You have to write a function that makes a set of the years and months in the dataset and then looks for the difference between expected and actual. `months_present = df.booking_date.dt.month.tolist()`, `months = set(x for x in range(1, 13))`, `months.difference(set(months_present))` – Trenton McKinney Oct 26 '19 at 19:53
  • Perfect! It worked. now one more thing. If I would want to add back the missing date for example 2017-06-01 to the same column, how can I do that? – Django0602 Oct 26 '19 at 19:59
  • I'd create a dataframe with the new data and append it. – Trenton McKinney Oct 26 '19 at 20:03
  • Even I did that but I want to keep the format of the column as DateTime only but after the append the format of column changes to a string. How to prevent that? – Django0602 Oct 26 '19 at 20:05
  • When I have two dataframes with a datatime column and I append them, the result is also a datatime column. Therefore, make certain the columns are properly typed. Something like `df1['booking_date'] = pd.to_datetime(df1['booking_date'])` – Trenton McKinney Oct 26 '19 at 20:11

1 Answers1

1

To use reindex on a month periods the index must be of timestamp type, but your index is the default index (consecutive numbers).

To find "missing" months, take the following approach:

Set index to booking_date, then resample by months and compute size of each group:

s = df1.set_index('booking_date').resample('MS').size()

Then limit this result to elements (months) with value of 0 and take their indices (maybe as an "ordinary" list):

s[s == 0].index.tolist()

If you want these missing months just as periods (not starting dates of each month), run:

s[s == 0].index.to_period('M')
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41