1

I've seen discussions on partial string match merges here, here, and elsewhere, but nothing on how to do it on the table containing the superstring instead of the substring.

Given sample data like

df1 = pd.DataFrame({'uri': ['http://www.foo.com/index', 
                            'https://bar.net/directory', 
                            'www.baz.gov/aboutus']})
df2 = pd.DataFrame({'fqdn': ['www.foo.com',
                             'www.qux.mil']})
print(df1)

                         uri
0   http://www.foo.com/index
1  https://bar.net/directory
2        www.baz.gov/aboutus
print(df2)

          fqdn
0  www.foo.com
1  www.qux.mil

my end goal is to remove the rows from df1 that contain any substring in df2. In the actual data, df1 has several million rows and df2 has a few hundred, and any given row in df1 will have at most one substring in df2.

Given the sample data, I hope to end up with a dataframe like

                         uri
0  https://bar.net/directory
1        www.baz.gov/aboutus

and by my logic, the intermediate step is to generate

                         uri           fqdn
0   http://www.foo.com/index    www.foo.com
1  https://bar.net/directory         np.NaN
2        www.baz.gov/aboutus         np.NaN

but I can't figure out how to check all of df2's values inside of df1.apply().


Edit:

While both answers below work, I got the fastest results in my specific scenario by compiling a regex object and using extract:

import re

fqdn_list= re.compile(f"({'|'.join(df2.fqdn)})")

df1['fqdn'] = df1.uri.str.extract(fqdn_list)
Andy
  • 3,132
  • 4
  • 36
  • 68

2 Answers2

2

Is this what you need ? str.findall

df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Out[192]: 
0    www.foo.com
1            NaN
2            NaN
Name: uri, dtype: object
#df1['fqdn']=df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • +1 This works perfectly on the toy example, but I don't know if it scales to the size of my set or if I need something with short-circuiting. I'm getting ahead of myself. I'll accept once I can test in the morning. – Andy Aug 09 '19 at 00:52
1

Concatenate the strings from df2 with the pipe (|) as seperator which is the or operator in regex. This way we can check if the df1 contains any of these strings and remove them with str.contains and ~ which is the not operator;

m = ~df1['uri'].str.contains('|'.join(df2['fqdn']))
df1[m]

Output

                         uri
1  https://bar.net/directory
2        www.baz.gov/aboutus
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • While this accomplishes what I need without the middle step when tested on the toy set, when applied to even a subset of the real data it never finishes, whereas WeNYoBen's approach runs almost instantly. Is that weird? – Andy Aug 09 '19 at 15:16
  • Yes, it was weird, and it was because of a typo. – Andy Aug 09 '19 at 15:31
  • Upon further testing using samples of up to 1M strings, your solution is approximately 20% faster. – Andy Aug 09 '19 at 15:48
  • That's nice to hear. Glad I could help. @Andy – Erfan Aug 09 '19 at 16:18
  • I managed to get a significant speed boost (~25%) by compiling a regex object, check the bottom of my question if you're interested. – Andy Aug 09 '19 at 18:59