0

I'm sure this has been answered elsewhere, but I can't think of how to exactly phrase it.

I am constructing a dataframe from orders. The way the csv is presented to me is, if the customer ordered more than 1 item, the following line in the csv is almost filled with NAs except for the Order Number and the item sku.

Order Number, Created at, Quantity, Sku, Tags, Bunch of other columns
11111, 1/29/2020, 1, XL-1234, Nan, ...
11112, 1/29/2020, 1, XL-1234, Wholesale, ...
11112, Nan, 2, SM-1254, Nan, ...
11113, 1/29/2020, 1, XL-1234, Nan, ...
df = pd.DataFrame({'Order Number':[11111, 11112, 11112, 11113],
                   'Created at':['1/29/2020', '1/29/2020', np.nan, '1/29/2020'],
                   'Quantity':[1,1,2,1],
                   'Sku':['XL-1234', 'XL-1234', 'SM-1254', 'XL-1234'],
                   'Tags':[np.nan, 'Wholesale', np.nan, np.nan]
                  })
print(df)

   Order Number Created at  Quantity      Sku       Tags
0         11111  1/29/2020         1  XL-1234        NaN
1         11112  1/29/2020         1  XL-1234  Wholesale
2         11112        NaN         2  SM-1254        NaN
3         11113  1/29/2020         1  XL-1234        NaN

What I need to do is fill the Nans of the same Order Number with the value that has been supplied, such as with a df.fillna(method='ffill', inplace=True).

The issue I keep running into is that it will fill all the Nans of a column with the most recent Non-Nan value.

Expected Result:
Order Number, Created at, Quantity, Sku, Tags, Bunch of other columns
11111, 1/29/2020, 1, XL-1234, Nan, ...
11112, 1/29/2020, 1, XL-1234, Wholesale, ...
11112, 1/29/2020, 2, SM-1254, Wholesale, ...
11113, 1/29/2020, 1, XL-1234, Nan, ...
df.fillna(method='ffill', inplace=True)
print(df)
   Order Number Created at  Quantity      Sku       Tags
0         11111  1/29/2020         1  XL-1234        NaN
1         11112  1/29/2020         1  XL-1234  Wholesale
2         11112  1/29/2020         2  SM-1254  Wholesale
3         11113  1/29/2020         1  XL-1234  Wholesale

I've tried to do things like:

for unique in df['Order Number'].unique():
    df.loc[df['Order Number'] == unique].fillna(method='ffill', inplace=True)

I can't currently use a df.groupby (that I can think of) because some of the customers don't have a 'Wholesale' tag, but are special customers that I need to consider on their own.

My hope is to split the main dataframe into 4 dataframes. These would track all the sales (this isn't exactly its own dataframe), sales for Wholesale, sales for "special" customers, and sales for the rest of the customers.

If someone could point me in the right direction, or at least help me figure out how to search for this else where, I would be very grateful!

  • Welcome to SO! Please take a moment to read about how to post pandas questions: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – YOLO Jan 29 '20 at 21:54
  • Is there a way to apply this to all columns at once, instead of going through each column one at a time? My dataframe has something like 30+ columns. I mean, I guess I don't use all of the columns, but it would still be a pain to go through the 6-8 columns I do use. – Blaine Haws Jan 29 '20 at 22:08
  • Think I got it. That link did help, once I figure out how to apply it to all columns. for column in df.columns: df[column] = df.groupby('Name')[column].apply(lambda x: x.fillna(method='ffill')) – Blaine Haws Jan 29 '20 at 22:36

0 Answers0