0

I have a data frame, df, with columns "TotalVolume," "Small," "Large," "AllSizes," and "TotalBags" that are correlated. Here, the TotalVolume is the sum of the other four columns.

Data frame

How do I loop through the data frame columns to see which entries are empty, then fill in the missing value with the value derived from the other columns? For example, if a value from the "AllSizes" column is missing, then the code would loop through the first, the second, the third, then the fourth column, seeing that it is missing a value and uses the fourth command to fill in the missing value.

Expected output:

Expected output

First, second, third, fourth, fifth commands respectively:

df['TotalVolume'] = df['Small'] + df['Large'] + df['AllSizes'] + df['TotalBags']
df['Small'] = df['TotalVolume'] - df['Large'] - df['AllSizes'] - df['TotalBags']
df['Large'] = df['TotalVolume'] - df['Small'] - df['AllSizes'] - df['TotalBags']
df['AllSizes'] = df['TotalVolume'] - df['Small'] - df['Large'] - df['TotalBags']
df['TotalBags'] = df['TotalVolume'] - df['Small'] - df['Large'] - df['AllSizes']
c200402
  • 153
  • 1
  • 10
  • What's the fourth command? Can you show what would be replaced with the nan values in the example you have provided? – TheFaultInOurStars Mar 23 '22 at 20:52
  • Use `.fillna()` method to fill empty values. – Barmar Mar 23 '22 at 20:54
  • @Barmar How do I use .fillna() here using a formula to derive the missing values? – c200402 Mar 23 '22 at 20:56
  • @AmirhosseinKiani The fourth command is `df['AllSizes'] = df['TotalVolume'] - df['Small'] - df['Large'] - df['TotalBags']`. I edited to show expected output. – c200402 Mar 23 '22 at 21:00
  • 2
    Did you read the documentation? The first argument is a series to use as the fill values. – Barmar Mar 23 '22 at 21:07
  • Keep in mind that if a row contains multiple NaNs, you will just end up filling NaN with NaN unless you address the issue. – tdy Mar 24 '22 at 04:40

1 Answers1

0

First of all, looping through a dataframe (using itertuples or iterrows for example) should always be used as a last resort, due to its inefficiency and slowness (More on this here: https://stackoverflow.com/a/24871316/18480015)

Assuming you have only 5 columns, you could replicate the code you wrote using the lambda function:

df['AllSizes'] = df.apply(
lambda row: row['TotalVolume']-row['Small']-row['Large']-row['TotalBags']
if np.isnan(row['AllSizes']) else row['AllSizes'], axis=1)

And so on for the other columns:

df['TotalBags'] = df.apply(
lambda row: row['TotalVolume']-row['Small']-row['Large']-row['AllSizes']
if np.isnan(row['AllSizes']) else row['AllSizes'], axis=1)

etc...

ofka27
  • 11
  • 2
  • `apply(axis=1)` is just a loop in disguise, so it's also very inefficient. Use [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html) like Barmar suggested. – tdy Mar 23 '22 at 21:13
  • When I tried using this, I still got null values in the five columns. Is it supposed to be like this: `df['Small'] = df.apply( lambda row: row['TotalVolume'] - row['Large'] - row['AllSizes'] - row['TotalBags'] if np.isnan(row['AllSizes']) else row['AllSizes'], axis=1)`, etc. ? – c200402 Mar 24 '22 at 03:51
  • @tdy For TotalBags, I tried using fillna() like this, but it doesn't work. Am I setting it up correctly? `df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes'] )` – c200402 Mar 24 '22 at 03:56
  • @c200402 Make sure to assign back: `df['TotalBags'] = df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes'])` – tdy Mar 24 '22 at 03:59
  • @tdy When I replaced axis=1 with `df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes'])`, I got an error message: "SyntaxError: expression cannot contain assignment, perhaps you meant "=="?" When I changed to `==`, I got another error message, "TypeError: unhashable type: 'Series'" – c200402 Mar 24 '22 at 04:34
  • @c200402 Not sure what you mean about replacing `axis=1`? The `fillna` approach is unrelated to `apply` and `axis=1`. – tdy Mar 24 '22 at 04:36
  • @tdy For "TotalBags," I used `if np.isnan(row['AllSizes']) else row['AllSizes'], df['TotalBags'] == df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes']))` instead of `if np.isnan(row['AllSizes']) else row['AllSizes'], axis=1)`. Did you mean this? – c200402 Mar 24 '22 at 04:42
  • @c200402 No, `fillna` is just its own command: `df['TotalBags'] = df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes'])` – tdy Mar 24 '22 at 04:46
  • @tdy I replaced the `=` with `==` as I got an error message about that, and now get another error message "if np.isnan(row['AllSizes']) else row['AllSizes'], axis=1, df['TotalBags'] == df['TotalBags'].fillna(df['Small'] + df['Large'] + df['AllSizes'] + df['TotalBags'])) SyntaxError: positional argument follows keyword argument." Do you know how to fix this? I'm unsure what this means. – c200402 Mar 24 '22 at 04:54
  • @c200402 I'm not sure how else to explain it. The `fillna` code **is just its own command,** but you're still putting it with some other if/else/axis stuff. Just copy-paste this line of code without altering it and run the line individually: `df['TotalBags'] = df['TotalBags'].fillna(df['TotalVolume']-df['Small']-df['Large']-df['AllSizes'])` – tdy Mar 24 '22 at 04:57
  • @tdy I still got `TotalVolume 7, Small 5, Large 6, AllSizes 10, TotalBags 10` null values – c200402 Mar 24 '22 at 05:09
  • @c200402 See [my comment above](https://stackoverflow.com/questions/71593834/python-loop-to-fill-missing-column-values/71594043?noredirect=1#comment126539266_71593834) for one of the likely reasons. – tdy Mar 24 '22 at 05:11
  • @tdy How do I address the issue if a row contains multiple NaNs? – c200402 Mar 24 '22 at 05:14
  • There's no clear answer. You'll just have to figure out how you want to deal with it. You might consider just dropping those rows. – tdy Mar 24 '22 at 05:20