Extracting specific rows from a data frame

Question

I have a data frame df1 with two columns 'ids' and 'names' -

ids     names
fhj56   abc
ty67s   pqr
yu34o   xyz

I have another data frame df2 which has some of the columns being -

user     values                       
1        ['fhj56','fg7uy8']
2        ['glao0','rt56yu','re23u']
3        ['fhj56','ty67s','hgjl09']

My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. Result should look like -

   user     values_responsible     names
   1        ['fhj56']              ['abc']
   3        ['fhj56','ty67s']      ['abc','pqr']

User 2 doesn't come in resultant table because none of its values exist in df1.

I was trying to do it as follows -

df2.query('values in @df1.ids')

But this doesn't seem to work well.

You need unnest your column `values` – BENY Aug 07 '17 at 18:57 — BENY, Aug 07 '17 at 18:57

Alexander · Accepted Answer · 2017-08-07T18:33:50.727

You can iterate through the rows and then use .loc together with isin to find the matching rows from df2. I converted this filtered dataframe into a dictionary

ids = []
names = []
users = []
for _, row in df2.iterrows():
    result = df1.loc[df1['ids'].isin(row['values'])]
    if not result.empty:
        ids.append(result['ids'].tolist())
        names.append(result['names'].tolist())
        users.append(row['user'])

>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
   user values_responsible       names
0     1            [fhj56]       [abc]
1     3     [fhj56, ty67s]  [abc, pqr]

Or, for tidy data:

ids = []
names = []
users = []
for _, row in df2.iterrows():
    result = df1.loc[df1['ids'].isin(row['values'])]
    if not result.empty:
        ids.extend(result['ids'].tolist())
        names.extend(result['names'].tolist())
        users.extend([row['user']] * len(result['ids']))

>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
   user values_responsible names
0     1              fhj56   abc
1     3              fhj56   abc
2     3              ty67s   pqr

Row-by-row operations in Pandas is a rather dirty way to go about the problem. [Here's](https://stackoverflow.com/a/7837947/4365003) a good explanation of why. In many cases, dataframe problems that can intuitively be solved with row-by-row iteration can also be solved by cleaver filtering or application of an `apply` or `map` function. — RagingRoosevelt, Aug 07 '17 at 18:33
Yes, row by row is not the most efficient, but that is because the data was not tidy to begin with. — Alexander, Aug 07 '17 at 18:34

score 2 · Answer 2 · answered Aug 07 '17 at 19:25

Try this , using the idea of unnest a list cell.

Temp_unnest = pd.DataFrame([[i, x]
              for i, y in df['values'].apply(list).iteritems()
                  for x in y], columns=list('IV'))

Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})


Out[942]: 
                   V       names
            <lambda>    <lambda>
user                            
1            [fhj56]       [abc]
3     [fhj56, ty67s]  [abc, pqr]

RagingRoosevelt · Answer 3 · 2017-08-07T18:23:11.070

I would refactor your second dataframe (essentially, normalizing your database). Something like

user     gid     id                       
1        1       'fhj56'
1        1       'fg7uy8'
2        1       'glao0'
2        1       'rt56yu'
2        1       're23u'
3        1       'fhj56'
3        1       'ty67s'
3        1       'hgjl09'

Then, all you have to do is merge the first and second dataframe on the id column.

r = df2.merge(df1, left_on='id', right_on='ids', how='left')

You can exclude any gids for which some of the ids don't have a matching name.

r[~r[gid].isin(  r[r['names'] == None][gid].unique()  )]

where r[r['names'] == None][gid].unique() finds all the gids that have no name and then r[~r[gid].isin( ... )] grabs only entries that aren't in the list argument for isin.

If you had more id groups, the second table might look like

user     gid     id                       
1        1       'fhj56'
1        1       'fg7uy8'
1        2       '1asdf3'
1        2       '7ada2a'
1        2       'asd341'
2        1       'glao0'
2        1       'rt56yu'
2        1       're23u'
3        1       'fhj56'
3        1       'ty67s'
3        1       'hgjl09'

which would be equivalent to

user     values                       
1        ['fhj56','fg7uy8']
1        ['1asdf3', '7ada2a', 'asd341']
2        ['glao0','rt56yu','re23u']
3        ['fhj56','ty67s','hgjl09']

Extracting specific rows from a data frame

3 Answers3

Linked