-1

So I have two huge lists of strings and I am trying to compare them using Python 3. List 1 has about 300k values and list 2 has about a million values. List 1 looks like that: ["aaa","bbb","ccc","ddd"] List 2 looks like that : ["aaa 1234 asd dsa","hjk lkj 1234","ccc 5678 aaa"]

What is the best way to find if each element from list 1 is a part of each element of list 2? The thing is that there might be more than one element from list 1 that is a part of element in list 2. Also there might be none as well.

If I try a nested loop it takes forever to go through the lists. Is there a better way?

kir.pir
  • 77
  • 6
  • 1
    Does this answer your question? https://stackoverflow.com/questions/15147751/how-to-check-if-all-items-in-a-list-are-there-in-another-list – Suryansu Dash Jul 26 '20 at 10:36
  • 1
    Please show the code you're currently using. The lists don't seem large enough to cause too much of a problem. It may be that you're just using an inefficient algorithm. – ekhumoro Jul 26 '20 at 10:40
  • Do you want to find the duplicate elements or just yes-no? – Sai Sreenivas Jul 26 '20 at 10:43
  • 3
    @SuryansuDash The elements in List 1 are *substrings* of the elements in List 2. So that other question doesn't really help much. – ekhumoro Jul 26 '20 at 10:47
  • Do you want any substrings or *words*? Said differently would `"aa"` be an element of `"aaa 1234 asd dsa"` – Serge Ballesta Jul 26 '20 at 11:02
  • For your example case can you illustrate the desired output and the code (although it may be inefficient) for generating? The comments and answers illustrate there is confusion on the output. – DarrylG Jul 26 '20 at 11:11

2 Answers2

1

Your problem is that you're looking for substring matching on the elements, which will always be expensive as you have to loop over the second list no matter what your solution.

Sets are the best way for comparing data, but you need each element as a unique item. An example of how to do this here would be:

list1 = ["aaa","bbb","ccc","ddd"]
list2 = ["aaa 1234 asd dsa","hjk lkj 1234","ccc 5678 aaa"]
set1 = set(list1)
for item in list2:
    if set1.issubset(set(item.split()):
        # the current item contains every value in list1

issubset() checks whether all the contents of the calling instance are inside the value passed to it. There are also additional methods very checking the intersection, difference, etc...

Of course, this solution assumes you're looking for unique space-separated values in the list2 items. If you're allowing partial matches, eg, "aaa" matching "aaab", then you'd have to do substring matching which will be slow. What is the nature of the problem you're trying to solve? This feels like a manual attempt at a database query for which there are much better solutions.

MattRickS
  • 41
  • 3
  • Have you tested your code with 300k items in `list1` and a million items in `list2`? The problem here isn't *how* to compare the elements - it's finding a way to do it efficiently. – ekhumoro Jul 26 '20 at 10:58
  • When comparing substrings for every element in a list you have no quick option. Substring matches are very slow. Set comparisons are fast but require separate elements. This is why it would be better to know the problem space, if the data can be provided in a better format it can be optimised better, otherwise the solutions are limited. – MattRickS Jul 26 '20 at 11:02
  • Only a tiny optimization, but I think that converting `list1` to a set is a waste of time, unless it contains a number of duplicates. – Serge Ballesta Jul 26 '20 at 11:05
  • @MattRickS Why don't you just test your code and see how long it takes to run? If it runs significantly faster than the OPs current solution, who cares about hypothetical "better" solutions? Practicality beats purity... – ekhumoro Jul 26 '20 at 11:14
  • @ekhumoro - If he provided any profiling that would be an option but we don't have anything to compare against. – MattRickS Jul 26 '20 at 11:20
  • @MattRickS You don't necessarily need comparisons - just do some basic testing with [timeit](https://docs.python.org/3/library/timeit.html#module-timeit) and put the results in your answer. If it demonstrates reasonable performance (in absolute terms), that's probably all that's needed. – ekhumoro Jul 26 '20 at 11:48
0

Will the match in list2 always be a whole word? If so, you could turn list2 into a set or dict, then go through list1 just once.

Something like:

list2_map = collections.defaultdict(list)
for item in list2:
  for word in item.split()
    list2_map[word].append(item)

for item in list1:
  if item in list2_map:
    print("%s appears in %s" % (item, ', '.join(list2_map[item])))

Because list2_map is a dict, looking up items is very fast; apart from that you're looping through each list once, which you have to do no matter the solution.

If you don't need to know which items of list2, you can replace the defaultdict with a collections.Counter (to just get a count) or a set (to just get a yes/no).

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17