1

I'm using Python 3.7.7 with Pandas 1.1.3.

I have two Pandas databases df1 and df2 with identical column names which can be listed by list(df1.columns) or list(df2.columns). Their length is len(list(df1.columns)) = 200.

I would like to look for all rows in df2 where the frist 10 values match the values in the i'th row of df1.

I am aware this can be done the following way:

colnames = list(df1.columns[:10]) # lists the first 10 column names
toFind = list(df1.iloc[i][:10]) # lists the first 10 values in the i'th row of df1
df2.loc[(df2[colnames[0]] == toFind[0]) & (df2[colnames[1]] == toFind[1]) & (df2[colnames[2]] == toFind[2]) & (df2[colnames[3]] == toFind[3]) & (df2[colnames[4]] == toFind[4]) & (df2[colnames[5]] == toFind[5]) & (df2[colnames[6]] == toFind[6]) & (df2[colnames[7]] == toFind[7]) & (df2[colnames[8]] == toFind[8]) & (df2[colnames[9]] == toFind[9])]

However this is extremely long and difficult to change. If I want to filter by a different number of matching indices (other than 10), I'd have to rewrite the whole expression every time.

Is there a way to do this automatically? I'm looking for something like

colnames = list(df1.columns[:10])
toFind = list(df1.iloc[i][:10])
df2.loc[(df2[colnames[t]] == toFind[t]) for t in range(len(colnames))]

However this obviously returns an error, as nothing like this is implemented in Pandas.

Example: (for simplicity's sake, I'll use 3 instead of 10, and set i=0)

df1:

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     4     8     2     4
1     2     5     7     1     4     8     2
2     6     6     8     8     9     1     3
3     7     8     5     2     3     0     9
4     4     0     7     4     5     6     3 
5     2     7     6     8     1     7     5

df2:

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     3     3     4     5
1     2     5     7     5     5     2     5
2     1     7     3     6     6     4     2
3     3     5     7     7     7     6     4
4     1     7     3     8     8     2     3 
5     2     5     7     3     9     0     6

Then I would like to select all rows of df2 where the first 3 values are 1 7 3 (same as the first 3 values of the 0th row of df1). That would be

filtered df2 (0):

i  col1  col2  col3  col4  col5  col6  col7
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3

As you can see, it filtered for the rows of df2, where the first 3 values are 1 7 3

If instead i=1, then I'd like to filter for the rows of df2where the first 3 values are 2 5 7 (same as the first 3 values of the 1st row of df1)

In that case, the output would be

filtered df2 (1):

i  col1  col2  col3  col4  col5  col6  col7
1     2     5     7     5     5     2     5
5     2     5     7     3     9     0     6

The aforementioned 3, that controls which columns I filter by is also a parameter, in the previous example, it controled col1 col2 col3, and in my first explanation it was 10, it controled col1 col2 ... col10

In general, it could be any list of the columns, for example [col5 col7 col8 col15].

Daniel P
  • 165
  • 7

1 Answers1

1

You can use np.logical_and.reduce:

i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])

mask = np.logical_and.reduce([(df2[colnames[t]]==toFind[t]) for t in range(len(colnames))])
df = df2[mask]

print (df)
   col1  col2  col3  col4  col5  col6  col7
i                                          
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3

Another idea is use:

i = 0
colnames = list(df1.columns[:3])
toFind = list(df1.iloc[i][:3])
df = df2[df2[colnames].eq(toFind).all(axis=1)]

print (df)
   col1  col2  col3  col4  col5  col6  col7
i                                          
0     1     7     3     3     3     4     5
2     1     7     3     6     6     4     2
4     1     7     3     8     8     2     3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The part with `df1[toFind]` gives me an error, because `len(toFind) = 10`, while the number of columns in `df1` is 200, and the remaining 190 columns aren't specified. `KeyError: "None of [Index(['col1','col2',...,'col10'], dtype='object')] are in the [columns]"` – Daniel P Mar 02 '22 at 13:37
  • Okay, well, the most recent edit does output something, however it doesn't contain row `i` of `df1`, how do I filter for that most easily? – Daniel P Mar 02 '22 at 13:42
  • @DanielP - Can you post some sample data with expected ouput? Loop is necessary? – jezrael Mar 02 '22 at 13:43
  • Okay, I'll edit my main post with an example. Give me a minute. – Daniel P Mar 02 '22 at 13:43
  • 1
    Added an example. Hopefully my question is clearer now. – Daniel P Mar 02 '22 at 13:57
  • 1
    @DanielP - answer was edited. – jezrael Mar 02 '22 at 14:01
  • 1
    On second thought, the second idea also works. Obviously you cna change `toFind` there as well. My bad. – Daniel P Mar 02 '22 at 14:09