1

Apologies in advance if my dataframe sample formatting is terrible, this is my first question and I'm also a novice with python.

I have a dataset that includes a list of items and tags associated with a particular item, there are other items in the set that are derived from the main item. These derived items do not currently have tags in their rows, but they should have the same tags as their parent item. There is an ID that I can use to link them, but they are not located in the same column or row (see example). These items are also within a bucket and the dataset may have multiple buckets, all with similar situations. I have successfully achieved the end result in an excel mockup using multiple worksheets and index match + conditions, but I can't figure out how to do this match across different rows using pandas.

Sample data set, there are many more columns than this, but I think this should work as an example.

Bucket Main Item ID Item Name Item Tag 1 Item Tag 2 Item Tag 3 Derived Item ID
26 123 Item A 50 1000 250 NaN
26 765 Item A (Derived) NaN NaN NaN 123

So I want to get the tags from Item 123 (50, 1000, 250) to replicate on any other item that has a matching item ID in the Derived Item ID column (there can be multiple).

Bucket Main Item ID Item Name Item Tag 1 Item Tag 2 Item Tag 3 Derived Item ID
26 123 Item A 50 1000 250 NaN
26 765 Item A (Derived) 50 1000 250 123

I originally tried to use a for loop to run through the rows and add any tags and Main Item IDs to their own dictionaries, then create a reference dataframe to match from, but I ran into issues with that. I've also tried to figure out if I could use isin() and np.where() to try and do all of the filling in one go, but again no luck.

I've been searching for related topics/questions for a few days and couldn't find anything similar. People are always trying to match across dataframes which ends up being a simple merge or something with df.loc, but I'd like to do this within the frame itself.

If this is best done by creating a separate dataframe I'm open to that, and any help or tips are greatly appreciated.

Apetri52
  • 13
  • 3
  • kindly let me know if you feel that your question was wrongly closed – Anurag Dabas Mar 24 '23 at 00:46
  • @Anurag Dabas The question you linked assumes fillna() style update of NaN values. This question asks about filling based on a join on two ID columns. These are different enough that it's not clear this question can be solved using answers to the prior one. – constantstranger Mar 24 '23 at 01:45

1 Answers1

0

What you've described can be done with a merge, but in this case the merge will join one column of the initial dataframe against another column of the same dataframe.

Here's a way to do it:

it = [col for col in df.columns if col.startswith('Item Tag')]
df.loc[df['Derived Item ID'].notna(), it] = (
    df[['Main Item ID','Derived Item ID']]
    .merge(df, how='left', left_on='Derived Item ID', right_on='Main Item ID')[it] )

Output:

   Bucket  Main Item ID         Item Name  Item Tag 1  Item Tag 2  Item Tag 3  Derived Item ID
0      26           123            Item A        10.0      1000.0       250.0              NaN
1      26           765  Item A (Derived)        10.0      1000.0       250.0            123.0

Explanation:

  • start with just the key columns Main Item ID and Derived Item ID
  • merge the Derived Item IDcolumn of that with the Main Item ID column of the original dataframe
  • for rows with non-null Derived Item ID, update the item tag columns in the original dataframe (which start out as NaN) with the ones from the merge output.
constantstranger
  • 9,176
  • 2
  • 5
  • 19
  • Thank you! This worked, I just have to go one level lower, but I think I can apply this same logic there. I think I understand what's happening here except for that last part after the merge. I get that you're making a list of columns where the condition is met, just not sure how it is being applied. Is there any documentation related to that type of use? – Apetri52 Mar 24 '23 at 19:33
  • Just to avoid any confusion, can you please tell me the specific code from my answer that you would like documentation about? – constantstranger Mar 24 '23 at 21:56
  • What purpose does the list 'it' serve at the end of the merge. merge(df, how='left', left_on='Derived Item ID', right_on='Main Item ID')[it] ) – Apetri52 Mar 25 '23 at 07:04
  • The variable `it` is simply shorthand for the list of column labels that are item tags. It's used on both the left and right sides of the assignment containing the merge, and it simply limits the columns that we copy from the merge result into the original dataframe to item tag columns. – constantstranger Mar 25 '23 at 14:33