What is the most efficient way to fill missing values in this data frame?

Question

I have the following pandas dataframe :

df = pd.DataFrame([
    ['A', 2017, 1],
    ['A', 2019, 1],
    ['B', 2017, 1],
    ['B', 2018, 1],
    ['C', 2016, 1],
    ['C', 2019, 1],
], columns=['ID', 'year', 'number'])

and am looking for the most efficient way to fill the missing years with a default value of 0 for the column number

The expected output is:

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

The dataframe that I have is relatively big, so I am looking for an efficient solution.

Edit:

This is the code that I have so far:

min_max_dict = df[['ID', 'year']].groupby('ID').agg([min, max]).to_dict('index')

new_ix = [[], []]
for id_ in df['ID'].unique():
    for year in range(min_max_dict[id_][('year', 'min')], min_max_dict[id_][('year', 'max')]+1): 
        new_ix[0].append(id_)
        new_ix[1].append(year)


df.set_index(['ID', 'year'], inplace=True)
df = df.reindex(new_ix, fill_value=0).reset_index()

Result

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Possible duplicate, https://stackoverflow.com/a/19324591/4985099 — sushanth, Jul 09 '20 at 15:26
@Sushanth I thought so initially, but not quite, there is a catch- for ID a, 2016 shouldnt be inserted and only later years should be inserted. — anky, Jul 09 '20 at 15:27
@Sushanth The problem with that is that I have multiple IDs, and multiple different dateranges (for ID A it is from 2017-2019, for ID B it is from 2017-2018) — borisdonchev, Jul 09 '20 at 15:28

score 19 · Accepted Answer · answered Jul 09 '20 at 16:17

A slightly faster approach rather than using explode is to use pd.Series constructor. And you can use .iloc if years are already sorted from earliest to latest.

idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()

Output:

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

@anky In this example with explode `17.2 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` with the pd.Series constructor `13.1 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` — Scott Boston, Jul 09 '20 at 16:59

anky · Answer 2 · 2020-07-09T16:06:05.413

Here is another approach with reindex

u = df.groupby('ID')['year'].apply(lambda x: range(x.min(),x.max()+1)).explode()

out = (df.set_index(['ID','year']).reindex(u.reset_index().to_numpy(),fill_value=0)
         .reset_index())

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

score 6 · Answer 3 · answered Jul 09 '20 at 15:52

t = df.groupby('ID')['year'].agg(['min','max']).reset_index()
t['missing'] = t.transform(lambda x: [y for y in range(x['min'], x['max']+1) if y not in x.values], axis=1)
t = t[['ID','missing']].explode('missing').dropna()
t['number'] = 0
t.columns = ['ID','year','number']
pd.concat([df,t]).sort_values(by=['ID','year'])

Output

    ID  year    number
0   A   2017    1
0   A   2018    0
1   A   2019    1
2   B   2017    1
3   B   2018    1
4   C   2016    1
2   C   2017    0
2   C   2018    0
5   C   2019    1

ALollz · Answer 4 · 2020-07-14T14:40:01.477

Here is a method that avoids any slow applies with a lambda. It's a memory inefficient solution in the sense that we create the base DataFrame which is a cross-product of all IDs and the year range in your DataFrame. After an update we can efficiently slice this down to the periods you need with a Boolean Mask. The mask is created from a cummax check in the forward and reverse directions.

If most IDs span the same general range of years there won't be too much waste in terms of creating the base DataFrame from a product. If you want even more performance there are many posts about more efficient ways to do a cross-product

def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

Alollz(df)

  ID  year  number
0  A  2017     1.0
1  A  2018     0.0
2  A  2019     1.0
3  B  2017     1.0
4  B  2018     1.0
5  C  2016     1.0
6  C  2017     0.0
7  C  2018     0.0
8  C  2019     1.0

That's certainly a lot more code than some of the other proposals. But to see where it really shines let's create some dummy data with 50K IDs (Here I'll let the date ranges be identical for all just for simplicity of creating test data).

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

# Make sure they give the same output:
(Alollz(df) == SB(df)).all().all()
#True

%timeit Alollz(df)
#1.9 s ± 73.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit SB(df)
#10.8 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So that's about 5x faster, which is a pretty big deal with things are taking on the order of seconds.

score 4 · Answer 5 · answered Jul 09 '20 at 15:43

Here is an approach:

letter_keys = df.ID.unique()
data = df.values
missing_records = []
for letter in letter_keys:
    print(letter)
    years = [x[1] for x in data if x[0] == letter]
    min_year = min(years)
    max_year = max(years)
    current_year = min_year
    while current_year<max_year:
        if current_year not in years:
            missing_records.append([letter, current_year,0])
            print('missing', current_year)
        current_year +=1

new_df = df.append(pd.DataFrame(missing_records, columns = df.columns)).sort_values(['ID','year'])

Output

| ID   |   year |   number |
|:-----|-------:|---------:|
| A    |   2017 |        1 |
| A    |   2018 |        0 |
| A    |   2019 |        1 |
| B    |   2017 |        1 |
| B    |   2018 |        1 |
| C    |   2016 |        1 |
| C    |   2017 |        0 |
| C    |   2018 |        0 |
| C    |   2019 |        1 |

score 2 · Answer 6 · answered Jul 09 '20 at 16:43

You can try using date_range and pd.merge:

g = df.groupby("ID")["year"].agg({"min":"min","max":"max"}).reset_index()
id_years = pd.DataFrame(list(g.apply(lambda row: list(row["ID"]) + 
                    list(pd.date_range(start=f"01/01/{row['min']}", \
                    end=f"01/01/{row['max']+1}",freq='12M').year), axis=1))).melt(0).dropna()[[0,"value"]]

id_years.loc[:,"value"] = id_years["value"].astype(int)
id_years = id_years.rename(columns = {0:"ID","value":'year'})
id_years = id_years.sort_values(["ID","year"]).reset_index(drop=True)

## Merge two dataframe
output_df = pd.merge(id_years, df, on=["ID","year"], how="left").fillna(0)
output_df.loc[:,"number"] = output_df["number"].astype(int)
output_df

output:

    ID  year    number
0   A   2017    1
1   A   2018    0
2   A   2019    1
3   B   2017    1
4   B   2018    1
5   C   2016    1
6   C   2017    0
7   C   2018    0
8   C   2019    1

score 1 · Answer 7 · answered Jul 09 '20 at 15:40

1

This would work, but will create a '2019' entry for 'B':

df.pivot(index='ID', columns='year', values='number').fillna(0).stack().to_frame('number')

returns:

    number
ID  year    
A   2016    0.0
2017    1.0
2018    0.0
2019    1.0
B   2016    0.0
2017    1.0
2018    1.0
2019    0.0
C   2016    1.0
2017    0.0
2018    0.0
2019    1.0

answered Jul 09 '20 at 15:40

dokteurwho

321
2
6

Unfortunately I am not looking for such a solution, because it assumes that every ID is the same. It not only includes a 2019 for B, but a 2016 for A – borisdonchev Jul 09 '20 at 15:42

sammywemmy · Answer 8 · 2021-10-21T09:19:28.860

We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows, in this case per ID group:

# pip install pyjanitor
import pandas as pd
import janitor as jn

# create mapping for range of years
years = dict(year = lambda year: range(year.min(), year.max() + 1))

# apply the complete function per group and fill the nulls with 0

df.complete(years, by = 'ID', sort = True).fillna(0, downcast = 'infer')
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

However, the by is primarily for convenience; there can be scenarios where it may be more efficient to do a bit more work; similar to @Alollz's solution:


# get the mapping for the year for the entire dataframe
years = dict(year =  range(df.year.min(), df.year.max() + 1))

# create a groupby
group = df.groupby('ID').year

#  assign the max and min years to the dataframe
(df.assign(year_max = group.transform('max'), 
           year_min = group.transform('min'))
     # run complete on the entire dataframe, without `by`
    # note that ID, year_min, year_max are grouped together
    # think of it as a DataFrame of just these three columns
    # combined with years .. we are not modifying these three columns
    # only the years 
   .complete(years, ('ID', 'year_min', 'year_max'))
    # filter rows where year is between max and min
   .loc[lambda df: df.year.between(df.year_min, df.year_max), 
        df.columns]
    # sort the values and fillna
   .sort_values([*df], ignore_index = True)
   .fillna(0, downcast = 'infer')
)
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

Using @Allolz's sample data:

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

def complete_sam(df):
    years = dict(year =  range(df.year.min(), df.year.max() + 1))
    group = df.groupby('ID').year
    outcome = (df.assign(year_max = group.transform('max'),
                         year_min = group.transform('min'))
                 .complete(years, ('ID', 'year_min', 'year_max'))
                 .loc[lambda df: df.year.between(df.year_min, 
                                                 df.year_max),
                     df.columns]
                 .sort_values([*df], ignore_index = True)
                 .fillna(0)
              )
    return outcome

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

#@Alollz's answer
def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

In [310]: Alollz(df).equals(complete_sam(df))
Out[310]: True

In [311]: %timeit complete_sam(df)
268 ms ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [312]: %timeit Alollz(df)
1.84 s ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [316]: SB(df).eq(complete_sam(df)).all().all()
Out[316]: True

In [317]: %timeit SB(df)
6.13 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What is the most efficient way to fill missing values in this data frame?

8 Answers8

Linked