Find matching substrings in two lists

Question

I have two lists: A and B. List lengths are not the same and they both contain strings. What is the best way to match substrings in both the lists?

list_A = ['hello','there','you','are']
list_B = ['say_hellaa','therefore','foursquare']

I would like a list of matching substrings called list_C which contains:

list_C = ['hell','there','are']

I came across this answer, but it requires me to have a list of matching substrings. Is there a way I can get what I want without manually creating a list of matching substrings?

This also does not help me cause the second list contains substrings.

`Also suggest ways to implement` Probably not intentional, but when you write like this, it comes off as quite rude, makes people less inclined to help. — SuperStew, Jun 28 '18 at 14:19
The most performant solution depends very much on how long the two lists are in relation to each other and how long the strings and patterns are on average. — user2390182, Jun 28 '18 at 14:22
@schwobaseggl I actually have a pandas column of about 500,000 rows and there are about 100 unique strings in the column. — sudnyank, Jun 28 '18 at 14:27
@sudnyank, Does the dup help you? If not, can you clarify why not? — jpp, Jun 28 '18 at 14:29
@jpp It does not. Because the second list contains substrings of the original. I specifically asked for a way that does not require me to create a list of matching substrings. — sudnyank, Jun 28 '18 at 15:28

score 3 · Answer 1 · answered Jun 28 '18 at 14:20

3

This is one approach. Using a list comprehension.

list_A = ['hello','there','you','are']
list_B = ['hell','is','here']
jVal = "|".join(list_A)        # hello|there|you|are

print([i for i in list_B if i in jVal ])

Output:

['hell', 'here']

answered Jun 28 '18 at 14:20

Rakesh

81,458
17
76
113

score 2 · Answer 2 · answered Jun 28 '18 at 14:21

2

Since you tag pandas solution from str.contains

#S_A=pd.Series(list_A)
#S_B=pd.Series(list_B)

S_B[S_B.apply(lambda x : S_A.str.contains(x)).any(1)]
Out[441]: 
0    hell
2    here
dtype: object

answered Jun 28 '18 at 14:21

BENY

317,841
20
164
234

piRSquared · Answer 3 · 2018-06-28T14:47:39.297

1

IIUC: I'd use Numpy

import numpy as np
from numpy.core.defchararray import find

a = np.array(['hello', 'there', 'you', 'are', 'up', 'date'])
b = np.array(['hell', 'is', 'here', 'update'])

bina = b[np.where(find(a[:, None], b) > -1)[1]]
ainb = a[np.where(find(b, a[:, None]) > -1)[0]]

np.append(bina, ainb)

array(['hell', 'here', 'up', 'date'], dtype='<U6')

edited Jun 28 '18 at 14:47

answered Jun 28 '18 at 14:24

piRSquared

285,575
57
475
624

Any reason [your answer here](https://stackoverflow.com/a/51085122/9209546) wouldn't work? (Feel free to reopen if you think there's a material difference in this question.) – jpp Jun 28 '18 at 14:27
Yeah, unless I misunderstood. OP wanted a two way check. Otherwise, I'm deleting. Actually, I didn't do that right either. Fixing. – piRSquared Jun 28 '18 at 14:28

score 0 · Answer 4 · edited Jun 28 '18 at 14:27

0

list_A = ['hello','there','you','are']
list_B = ['hell','is','here']
List_C = []

for a in list_A:
    for b in list_B:
        print(a,"<->",b)
        if a in b:
            List_C.append(a)
        if b in a:
            List_C.append(b)

print(List_C)

edited Jun 28 '18 at 14:27

Bugs

4,491
9
32
41

answered Jun 28 '18 at 14:26

Lyux

453
1
10
22

6

Please refrain from using offensive language in your posts. Thank you. – Bugs Jun 28 '18 at 14:27

score 0 · Answer 5 · answered Jun 28 '18 at 15:36

For funsies, here's an answer that uses regex!

import re

matches = []
for pat in list_B:
    matches.append(re.search(pat, ' '.join(list_A)))
matches = [mat.group() for mat in matches if mat]
print(matches)
# ['hell', 'here']

This returns a match object for each match that is found, the actual string of which is found by match.group(). Note that if no match is found (as is the case for the second element in your list_B), you get a None in matches, thus the need to add the if mat at the end of the list comprehension.

Find matching substrings in two lists

5 Answers5