4

Let's say I have a string and a list of strings:

a = 'ABCDEFG'

b = ['ABC', 'QRS', 'AHQ']

How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']

The most important issue is that I have tens of millions of strings, so that time efficiency is essential.

Nix
  • 57,072
  • 29
  • 149
  • 198
  • Do you want only the first match or a list of all matches? – mgilson Jan 17 '13 at 22:12
  • 1
    Just because you have tens of millions of strings doesn't mean that the obvious answer is too slow. Don't just assume it's slow—time it and see. Show us the code, tell us how long it takes, and tell us that's too long, or people will keep posting answers telling you what you already know. – abarnert Jan 17 '13 at 22:12
  • Is this going to be a console app ? Or do you have a server that you are going to run this code on ? – Nix Jan 17 '13 at 22:13
  • If you _do_ need to optimize this: Is `b` a big list of strings that you only need to run once apiece against a single `a`? (In other words, is preprocessing `a` the obvious right thing to do, or not?) – abarnert Jan 17 '13 at 22:13
  • I have 50,000,000 stings and a list of strings that is 6000 strings long. – user1843553 Jan 17 '13 at 22:23
  • I have tried several way of doing this, none mentioned here so far. I appreciate the new ideas. – user1843553 Jan 17 '13 at 22:28
  • @user1843553: Where do those 50M and 6K come in? Are you looping over 50M `a` values and using the same 6K `b` values each time? Or vice-versa? Or… something else? – abarnert Jan 17 '13 at 22:30

5 Answers5

4

If you only want the first match in b:

next((s for s in b if s in a), None)

This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Very nice & quick solution. I am not sure it is Pythonic (hey, you don't have a space before `None`!) ;) – Tadeck Jan 17 '13 at 22:37
  • @mglison: how about if you want repeat the matching list in a for example in the above case..ABCABCDEFG – Praneeth Jun 22 '15 at 20:07
1

Keep in mind that Python's substring search x in a is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.

However, if you have a more specialized case, you can do much better.

For example, if you have an arbitrary list of millions of strings b that all need to be searched for within one giant static string a that never changes, preprocessing a can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)

On the other hand, if you expect matches to be unlikely, and you've got the whole b list in advance, you can probably get some large gains by organizing b in some way. For example, there's no point searching for "ABCD" if "ABC" already failed; if you need to search both "ABC" and "ABD" you can search for "AB" first and then check whether it's followed by "C" or "D" so you don't have to repeat yourself; etc. (It might even be possible to merge all of b into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)

But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.

Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

Answer:

(x for x in b if x in a )

That will return a generator that will be a list of ones that match. Take the first or loop over it.

Nix
  • 57,072
  • 29
  • 149
  • 198
  • 2
    If he wants more, he needs to give us context on how he is running it, I would also like to know how he is going to load 10s of millions of strings into an array. – Nix Jan 17 '13 at 22:11
  • 1
    @Nix: He did explicitly say "The most important issue is that I have tens of millions of strings, so that time efficiency is essential." – abarnert Jan 17 '13 at 22:14
  • 1
    @Nix: The time to search for `x in a` is orders of magnitude higher than the time to just iterate each `x`. And I don't see how pointing out that you completely ignored what the OP called "the most important issue" counts as trolling. – abarnert Jan 17 '13 at 22:20
  • Thats not true. B is fixed length, A is variable length. So worst case we are talking 0N vs 0MN. – Nix Jan 17 '13 at 22:23
  • @Nix: I'm not even sure what you're arguing here. You're completely right that the time to do `x for x in b for x in a` is `O(AB)`; the time to do `x for x in b` is `O(B)` Which means that "the time to iterate the list" is not the issue here. Unless you're saying that "quadratic is not worse than linear", or "I know for a fact that A << B even though I don't know what the problem is"? – abarnert Jan 17 '13 at 22:28
  • @Nix: Since you keep deleting and editing your comments as soon as I respond to them, I can't even make sense of this discussion anymore. – abarnert Jan 17 '13 at 22:31
  • You just commented on your self. Not deleting anything, maybe you should go back and add more details to your answer. – Nix Jan 17 '13 at 22:33
  • @Nix: You had a comment between my first two comments. Your (now) second comment said "0AB vs 0B" but then you changed it to "0N vs 0MN" after I replied, and removed a bunch of extra argument. It doesn't make much sense to deny what you did on a site that retains history. – abarnert Jan 18 '13 at 19:28
0
In [3]: [s for s in b if s in a]
Out[3]: ['ABC']

On my machine this takes about 3 seconds when b contains 20,000,000 elements (tested with a and b containing strings similar to those in the question).

NPE
  • 486,780
  • 108
  • 951
  • 1,012
0

You might want to have a look at the following algorithm:

Boyer–Moore string search algorithm And wikipedia

But without knowing more, this might be overkill!

beoliver
  • 5,579
  • 5
  • 36
  • 72