What's the best way to find "missing" values in a dataframe?

Question

Say there's a dataframe:

import pandas as pd
df = pd.DataFrame([1,2,3,4,5, 7,8, 10])

I want to find the "missing" numbers in it (6 and 9). My code to do this is:

li = []
low = int(min(df.values))
high = int(max(df.values))

for i in range(low, high+1):
    if i not in df.values:
        li.append(i)

print(li)
>>> [6, 9]

But if the dataframe is huge, this may take some time with a for loop. In my case, with a dataframe of length ~300k rows, its taking 162 seconds.

Is there a more efficient (vectorized?) way to do this?

Yeah, serially from 1 to a threshold number (we can call it `m`). — Kristada673, Jul 20 '18 at 08:14
@ user2285236 looks good to me, post it as answer... If you use OP's borders one can even see how really short that is: `np.setdiff1d(np.arange(1, m+1), df[0])` — SpghttCd, Jul 20 '18 at 08:18

score 3 · Accepted Answer · answered Jul 20 '18 at 08:16

3

Just make a list of the full range (assuming your bounds are represented in df), and then use isin() to find the difference.

m = 10
full = pd.Series(np.arange(1, m+1))

full[~full.isin(df[0])].values
# array([6, 9])

answered Jul 20 '18 at 08:16

andrew_reece

20,390
3
33
58

Yes! This reduced the running time from 162 seconds to 0.05 seconds! – Kristada673 Jul 20 '18 at 08:29
Great! Glad to be of help. – andrew_reece Jul 20 '18 at 08:31
Did you try and measure the `np.setdiff1d` solution of user2285236, too? would be interesting to see how it compares. – SpghttCd Jul 20 '18 at 11:15

score 0 · Answer 2 · edited Jul 20 '18 at 08:20

0

df['didf'] = df[0] - df[0].shift(1) will highlight gaps while values greater than 1 means a missing value

edited Jul 20 '18 at 08:20

Mr. T

11,960
10
32
54

answered Jul 20 '18 at 08:18

krayyem

39
1
6

What's the best way to find "missing" values in a dataframe?

2 Answers2

Linked