2

I'm new to pandas so likely overlooking something but I've been searching and haven't found anything helpful yet.

What I'm trying to do is this. I have 2 dataframes. df1 has only 1 column and an unknown number of rows. df2 has an unknown number of rows also and also an unknown number of columns for each index.

Example:
df1:
0     1117454
1     1147637
2     1148945
3     1149662
4     1151543
5     1151545
6     1236268
7     1236671
8     1236673
...
300   1366962

df2:
                1        2        3        4        5        6        7  
8302813476  1375294  1375297  1375313  1375318  1375325  1375330  1375331   
8302813477  1317422  1363270  1363288  1363262     None     None     None   
8302813478  1187269  1187276  1149662  1147843  1147639  1236650  1236656

So what I want is to check all df1 values against df2 column 1 - n and if there is a match with any value in df1 mark the index of df2 as True else it is False.

piRSquared
  • 285,575
  • 57
  • 475
  • 624

2 Answers2

3

I think you can use isin for testing matching of Series created from df2 by stack with Series created from one column df1 by squeeze. Last reshape by unstack:

df3 = df2.stack().isin(df1.squeeze()).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False

Then get find all values where at least one True by any:

a = df3.any(axis=1)
print (a)
8302813476    False
8302813477    False
8302813478     True
dtype: bool

And last boolean indexing:

print (a[a].index)
Int64Index([8302813478], dtype='int64')

Another solution is instead squeeze use df1['col'].unique(), thank you Ted Petrou:

df3 = df2.stack().isin(df1['col'].unique()).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False

---

I like squeeze more, but same output is simple selecting column of df1:

df3 = df2.stack().isin(df1['col']).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Why is `squeeze` necessary? Can't you change `df1.squeeze()` to `df1['col'].unique()`? Or does `isin` take care of uniqueness under the hood? – Ted Petrou Dec 16 '16 at 07:07
  • Yes, it is another solution, I think faster. – jezrael Dec 16 '16 at 07:10
  • exactly. isin do uniqueness, but then solution is slowier. – jezrael Dec 16 '16 at 07:11
  • @TedPetrou I think squeeze is convenient in answering questions when it's unclear what the OP has. I've seen posters say they had a DataFrame when it was a Series and if it was a DataFrame I don't know what the column name is. I could do `df.iloc[:, 0]` but I think that's ugly sometimes. – piRSquared Dec 16 '16 at 07:20
  • Ok thanks. that makes sense. It just looked bizarre. Was wondering what I was missing. – Ted Petrou Dec 16 '16 at 07:22
3

As an interesting numpy alternative

l1 = df1.values.ravel()
l2 = df2.values.ravel()

pd.DataFrame(
    np.equal.outer(l1, l2).any(0).reshape(df2.values.shape),
    df2.index, df2.columns
)

or using set, list and comprehension

l1 = set(df1.values.ravel().tolist())
l2 = df2.values.ravel().tolist()

pd.DataFrame(
    np.array([bool(l1.intersection([d])) for d in l2]).reshape(df2.values.shape),
    df2.index, df2.columns
)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624