4

I have two lists: A and B. List lengths are not the same and they both contain strings. What is the best way to match substrings in both the lists?

list_A = ['hello','there','you','are']
list_B = ['say_hellaa','therefore','foursquare']

I would like a list of matching substrings called list_C which contains:

list_C = ['hell','there','are']

I came across this answer, but it requires me to have a list of matching substrings. Is there a way I can get what I want without manually creating a list of matching substrings?

This also does not help me cause the second list contains substrings.

sudnyank
  • 63
  • 1
  • 4
  • 1
    `Also suggest ways to implement` Probably not intentional, but when you write like this, it comes off as quite rude, makes people less inclined to help. – SuperStew Jun 28 '18 at 14:19
  • The most performant solution depends very much on how long the two lists are in relation to each other and how long the strings and patterns are on average. – user2390182 Jun 28 '18 at 14:22
  • @SuperStew Apologies, not intended. Will edit it out. – sudnyank Jun 28 '18 at 14:24
  • 1
    A duplicate was asked 5 minutes earlier! – jpp Jun 28 '18 at 14:26
  • @schwobaseggl I actually have a pandas column of about 500,000 rows and there are about 100 unique strings in the column. – sudnyank Jun 28 '18 at 14:27
  • @sudnyank, Does the dup help you? If not, can you clarify why not? – jpp Jun 28 '18 at 14:29
  • @jpp It does not. Because the second list contains substrings of the original. I specifically asked for a way that does not require me to create a list of matching substrings. – sudnyank Jun 28 '18 at 15:28

5 Answers5

3

This is one approach. Using a list comprehension.

list_A = ['hello','there','you','are']
list_B = ['hell','is','here']
jVal = "|".join(list_A)        # hello|there|you|are

print([i for i in list_B if i in jVal ])

Output:

['hell', 'here']
Rakesh
  • 81,458
  • 17
  • 76
  • 113
2

Since you tag pandas solution from str.contains

#S_A=pd.Series(list_A)
#S_B=pd.Series(list_B)

S_B[S_B.apply(lambda x : S_A.str.contains(x)).any(1)]
Out[441]: 
0    hell
2    here
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
1

IIUC: I'd use Numpy

import numpy as np
from numpy.core.defchararray import find

a = np.array(['hello', 'there', 'you', 'are', 'up', 'date'])
b = np.array(['hell', 'is', 'here', 'update'])

bina = b[np.where(find(a[:, None], b) > -1)[1]]
ainb = a[np.where(find(b, a[:, None]) > -1)[0]]

np.append(bina, ainb)

array(['hell', 'here', 'up', 'date'], dtype='<U6')
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Any reason [your answer here](https://stackoverflow.com/a/51085122/9209546) wouldn't work? (Feel free to reopen if you think there's a material difference in this question.) – jpp Jun 28 '18 at 14:27
  • Yeah, unless I misunderstood. OP wanted a two way check. Otherwise, I'm deleting. Actually, I didn't do that right either. Fixing. – piRSquared Jun 28 '18 at 14:28
0
list_A = ['hello','there','you','are']
list_B = ['hell','is','here']
List_C = []

for a in list_A:
    for b in list_B:
        print(a,"<->",b)
        if a in b:
            List_C.append(a)
        if b in a:
            List_C.append(b)

print(List_C)
Bugs
  • 4,491
  • 9
  • 32
  • 41
Lyux
  • 453
  • 1
  • 10
  • 22
0

For funsies, here's an answer that uses regex!

import re

matches = []
for pat in list_B:
    matches.append(re.search(pat, ' '.join(list_A)))
matches = [mat.group() for mat in matches if mat]
print(matches)
# ['hell', 'here']

This returns a match object for each match that is found, the actual string of which is found by match.group(). Note that if no match is found (as is the case for the second element in your list_B), you get a None in matches, thus the need to add the if mat at the end of the list comprehension.

Engineero
  • 12,340
  • 5
  • 53
  • 75