Python pandas to ensure each row based on column value has a set of data present, if not add row

Question

I am organising AWS resources for tagging, and have captured data into a CSV file. A sample output of the CSV file is as follows. I am trying to make sure that for each resource_id, there is a dataset of tag_key that I need to ensure is present. This dataset is

tag_key

Application
Client
Environment
Name
Owner
Project
Purpose

I'm new to pandas, I've only managed to get the CSV file read as a dataframe

import pandas as pd

file_name = "z.csv"

df = pd.read_csv(file_name, names=['resource_id', 'resource_type', 'tag_key', 'tag_value'])

print (df)

CSV file

vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company

I am expecting the output to be as follows

vol-00441b671ca48ba41,volume,Environment,Development
vol-00441b671ca48ba41,volume,Name,Database Files
vol-00441b671ca48ba41,volume,Project,Application Development
vol-00441b671ca48ba41,volume,Purpose,Web Server
vol-00441b671ca48ba41,volume,Client,
vol-00441b671ca48ba41,volume,Owner,
vol-00441b671ca48ba41,volume,Application,
i-1234567890abcdef0,instance,Environment,Production
i-1234567890abcdef0,instance,Owner,Fast Company
i-1234567890abcdef0,instance,Application,
i-1234567890abcdef0,instance,Client,
i-1234567890abcdef0,instance,Name,
i-1234567890abcdef0,instance,Project,
i-1234567890abcdef0,instance,Purpose,

Possible duplicate of [How to iterate over rows in a DataFrame in Pandas?](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) — M_S_N, Oct 08 '19 at 14:57
@M_S_N I've seen the post, but it's not the same, I've modified my post — meappy, Oct 08 '19 at 15:31

score 0 · Answer 1 · answered Oct 08 '19 at 14:57

To make a slightly simpler example. I have dataframe df:

df = pd.DataFrame(data={'a': [1, 1, 2, 2], 'b': [[1, 2], [3, 5], [1, 2], [5]]})

Returning

   a       b
0  1  [1, 2]
1  1  [3, 5]
2  2  [1, 2]
3  2     [5]

With required b's: 1, 2, 3, 4 and 5.

Then we need to find out what we 'already have'. This we do:

def flatten(lsts):
    return [j for i in lsts for j in i]

df_new = df.groupby(by=['a'])['b'].apply(flatten)

Returns:

a
1    [1, 2, 3, 5]
2       [1, 2, 5]

Now we need to list the columns we are missing and add those:

df_new = df_new.reset_index()
lst_wanted = [1, 2, 3, 4, 5]

for row in df_new.itertuples():
    for j in lst_wanted:
        if j not in row.b:
            df = df.append({'a': row.a, 'b': j}, ignore_index=True)
print(df)

Returning:

   a       b
0  1  [1, 2]
1  1  [3, 5]
2  2  [1, 2]
3  2     [5]
4  1       4
5  2       3
6  2       4

I like this method, it appends rows and not abandons them – meappy Oct 08 '19 at 15:47 — meappy, Oct 08 '19 at 15:47

score 0 · Accepted Answer · answered Oct 08 '19 at 14:57

One way to do this is to use multindexes, from_product, and renindex:

taglist = ['Application',
           'Client',
           'Environment',
           'Name',
           'Owner',
           'Project',
           'Purpose']

df_out = df.set_index(['resource_id','tag_key'])\
           .reindex(pd.MultiIndex.from_product([df['resource_id'].unique(), taglist],
                                              names=['resource_id','tag_key']))

df_out.assign(resource_type = df_out.groupby('resource_id')['resource_type']\
                                    .ffill().bfill()).reset_index()

Output:

              resource_id      tag_key resource_type                tag_value
0   vol-00441b671ca48ba41  Application        volume                      NaN
1   vol-00441b671ca48ba41       Client        volume                      NaN
2   vol-00441b671ca48ba41  Environment        volume              Development
3   vol-00441b671ca48ba41         Name        volume           Database Files
4   vol-00441b671ca48ba41        Owner        volume                      NaN
5   vol-00441b671ca48ba41      Project        volume  Application Development
6   vol-00441b671ca48ba41      Purpose        volume               Web Server
7     i-1234567890abcdef0  Application      instance                      NaN
8     i-1234567890abcdef0       Client      instance                      NaN
9     i-1234567890abcdef0  Environment      instance               Production
10    i-1234567890abcdef0         Name      instance                      NaN
11    i-1234567890abcdef0        Owner      instance             Fast Company
12    i-1234567890abcdef0      Project      instance                      NaN
13    i-1234567890abcdef0      Purpose      instance                      NaN

@m3appy Would you consider [accepting](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work?answertab=votes#tab-top) this solution? — Scott Boston, Oct 08 '19 at 15:17
just found that rows which do not have data that match the 'taglist' will get abandoned, is there a way to "add and not remove"? — meappy, Oct 08 '19 at 15:22
tried that got this `ValueError: operands could not be broadcast together with shapes (7,) (6,) ` — meappy, Oct 08 '19 at 15:53
Can you create a small dataset that generates this error? Create a new question indicate the error and the code generating this error. I need to be able to duplicate the error inorder to make a suggestion. — Scott Boston, Oct 08 '19 at 16:43
Hi ScottBoston I had to do something like this `taglist = np.unique(np.concatenate((taglist,taglist_present),0))`. I'll start a new question as I think this has deviated — meappy, Oct 09 '19 at 02:02

Python pandas to ensure each row based on column value has a set of data present, if not add row

2 Answers2

Linked