0

Say I have two dataframes A and B, each containing two columns called x and y. I want to join these two dataframes but not on rows on which the x and y columns are equal across the two dataframes, but on rows where A's x columns is a substring of B's x column and same for y. For example

if A[x][1]='mpla' and B[x][1]='mplampla'

I would want that to be captured.

On sql it would be something like:

select *
from A
join B
on A.x<=B.x and A.y<=B.y.

Can something like this be done on python?

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
yortos
  • 1
  • The SQL you gave does not implement what you described. Which one do you want: substring matching, or lexicographical comparison? – John Zwinck Jan 21 '15 at 05:57
  • Oh I know, it was just an example. On this specific occasion I need substring matching. But I ideally I would like to learn if there is way to implement it for other relationships too (for example, if I had numbers etc...) – yortos Jan 21 '15 at 06:08
  • Here is a similar question... http://stackoverflow.com/questions/22723286/complex-joins-in-pandas – DataByDavid Jan 22 '15 at 03:27

2 Answers2

0

You can match a single string at a time against all the strings in one column, like this:

import numpy.core.defchararray as ca

ca.find(B.x.values.astype(str), 'mpla') >= 0

The problem with that is you'll have to loop over all elements of A. But if you can afford that, it should work.

See also: pandas + dataframe - select by partial string

Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks, I will give this a shot, but they are both very large files that's why I'm looking for a more efficient way. – yortos Jan 21 '15 at 06:11
  • If you can't find anything in the standard packages, you should consider writing your own C extension. Then you could implement whatever efficient algorithm you can think of. – John Zwinck Jan 21 '15 at 06:13
0

you could try something like

B.x.where(B.x.str.contains(A.x), B.index,         axis=index) #this would give you the ones that don't match 


B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match.  You could see if you can use the "^" operator used for regex to get the ones that match.

You could also maybe try

np.where(B.x.str.contains(A.x), B.index, np.nan)

also you can try:

matchingmask = B[B.x.str.contains(A.x)]

matchingframe = B.ix[matchingmask.index] #or 

matchingcolumn = B.ix[matchingmask.index].x #or

matchingindex = B.ix[matchingmask.index].index

All of these assume you have the same index on both frames (I think)

You want to look at the string methods: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

you want to read up on regex and pandas where method: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

Skorpeo
  • 2,362
  • 2
  • 15
  • 20