Python Join two dataframes on columns meeting a condition

Question

Say I have two dataframes A and B, each containing two columns called x and y. I want to join these two dataframes but not on rows on which the x and y columns are equal across the two dataframes, but on rows where A's x columns is a substring of B's x column and same for y. For example

if A[x][1]='mpla' and B[x][1]='mplampla'

I would want that to be captured.

On sql it would be something like:

select *
from A
join B
on A.x<=B.x and A.y<=B.y.

Can something like this be done on python?

The SQL you gave does not implement what you described. Which one do you want: substring matching, or lexicographical comparison? — John Zwinck, Jan 21 '15 at 05:57
Oh I know, it was just an example. On this specific occasion I need substring matching. But I ideally I would like to learn if there is way to implement it for other relationships too (for example, if I had numbers etc...) — yortos, Jan 21 '15 at 06:08
Here is a similar question... http://stackoverflow.com/questions/22723286/complex-joins-in-pandas — DataByDavid, Jan 22 '15 at 03:27

score 0 · Answer 1 · edited May 23 '17 at 12:12

0

You can match a single string at a time against all the strings in one column, like this:

import numpy.core.defchararray as ca

ca.find(B.x.values.astype(str), 'mpla') >= 0

The problem with that is you'll have to loop over all elements of A. But if you can afford that, it should work.

See also: pandas + dataframe - select by partial string

edited May 23 '17 at 12:12

Community

1
1

answered Jan 21 '15 at 06:09

John Zwinck

239,568
38
324
436

Thanks, I will give this a shot, but they are both very large files that's why I'm looking for a more efficient way. – yortos Jan 21 '15 at 06:11
If you can't find anything in the standard packages, you should consider writing your own C extension. Then you could implement whatever efficient algorithm you can think of. – John Zwinck Jan 21 '15 at 06:13

Skorpeo · Answer 2 · 2015-01-21T09:07:20.230

you could try something like

B.x.where(B.x.str.contains(A.x), B.index,         axis=index) #this would give you the ones that don't match 


B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match.  You could see if you can use the "^" operator used for regex to get the ones that match.

You could also maybe try

np.where(B.x.str.contains(A.x), B.index, np.nan)

also you can try:

matchingmask = B[B.x.str.contains(A.x)]

matchingframe = B.ix[matchingmask.index] #or 

matchingcolumn = B.ix[matchingmask.index].x #or

matchingindex = B.ix[matchingmask.index].index

All of these assume you have the same index on both frames (I think)

You want to look at the string methods: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

you want to read up on regex and pandas where method: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

Python Join two dataframes on columns meeting a condition

2 Answers2