2

Say there's a dataframe:

import pandas as pd
df = pd.DataFrame([1,2,3,4,5, 7,8, 10])

I want to find the "missing" numbers in it (6 and 9). My code to do this is:

li = []
low = int(min(df.values))
high = int(max(df.values))

for i in range(low, high+1):
    if i not in df.values:
        li.append(i)

print(li)
>>> [6, 9]

But if the dataframe is huge, this may take some time with a for loop. In my case, with a dataframe of length ~300k rows, its taking 162 seconds.

Is there a more efficient (vectorized?) way to do this?

Kristada673
  • 3,512
  • 6
  • 39
  • 93

2 Answers2

3

Just make a list of the full range (assuming your bounds are represented in df), and then use isin() to find the difference.

m = 10
full = pd.Series(np.arange(1, m+1))

full[~full.isin(df[0])].values
# array([6, 9])
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
0

df['didf'] = df[0] - df[0].shift(1) will highlight gaps while values greater than 1 means a missing value

Mr. T
  • 11,960
  • 10
  • 32
  • 54
krayyem
  • 39
  • 1
  • 6