0

I have the following list:

data_items = ['abc','123data','dataxyz','456','344','666','777','888','888', 'abc', 'xyz']

And I have a list of search items:

search = ['abc','123','xyz','456']

I want to iterate over the data_items for a match using the search list and build a basic structure that provides a count for each match. e.g.

counts = ['abc':'2', '123':'1', 'xyz':'2'.........]

What the best way to do this?

msw
  • 42,753
  • 9
  • 87
  • 112
user1513388
  • 7,165
  • 14
  • 69
  • 111
  • See related: http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python/2600208#2600208 – EdChum Apr 28 '14 at 13:50

3 Answers3

4

You could use re.search and a collections.Counter, eg:

import re
from collections import Counter

data_items = ['abc','123data','dataxyz','456','344','666','777','888','888', 'abc', 'xyz']
search = ['abc','123','xyz','456']

to_search = re.compile('|'.join(sorted(search, key=len, reverse=True)))
matches = (to_search.search(el) for el in data_items)
counts = Counter(match.group() for match in matches if match)
# Counter({'abc': 2, 'xyz': 2, '123': 1, '456': 1})
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • This code iterates `data_items` once per element in `search`, i.e. it might perform poorly if `data_items` is very long and there's more than one element in `search`. Or maybe Python does some sort of loop fusion? :-) – Frerich Raabe Apr 28 '14 at 14:18
  • @FrerichRaabe It does a search only once rather than checking each `search` for each `data_item`... The regex engine can possibly optimise the `|`'d candidates (but I'm not 100% sure on that) - at the very least it shouldn't really be any more non-per formant than a nested loop and break on match structure... – Jon Clements Apr 28 '14 at 14:24
  • Oops, you're right of course, I accidentally commented on the wrong answer (my remark was meant to be to GammaAmino's answer)! – Frerich Raabe Apr 28 '14 at 14:28
  • Now that I re-read your code, I think the `to_search` part should make sure to escape special (as far as the regular expression engine goes) characters, i.e. it should read something like `re.compile('|'.join(re.escape(s) for s in sorted(search, key=len, reverse=True)))` – Frerich Raabe Apr 28 '14 at 14:30
  • @FrerichRaabe it should indeed... Also, it's probably worth noting it will give different results than xbb's answer which will count all possible matches per each string rather than stopping on a first match... Which is the correct behaviour I'm not sure of :) – Jon Clements Apr 28 '14 at 14:34
1

Looks like you need a partial match too. Code below is intuitive but may not be efficient. And also assumes you're ok with dict result.

>>> data_items = ['abc','123data','dataxyz','456','344','666','777','888','888', 'abc', 'xyz']
>>> search = ['abc','123','xyz','456']
>>> result = {k:0 for k in search}
>>> for item in data_items:
        for search_item in search:
            if search_item in item:
                result[search_item]+=1
>>> result
{'123': 1, 'abc': 2, 'xyz': 2, '456': 1}
xbb
  • 2,073
  • 1
  • 19
  • 34
  • +1 this is what I would have started with (except that maybe a [`collections.Counter`](https://docs.python.org/2/library/collections.html#collections.Counter) instead of a plain dictionary would have been a bit nicer). – Frerich Raabe Apr 28 '14 at 14:20
  • @FrerichRaabe I'll also just add that if using a normal `dict` and since `int`s are immutable, that `result = dict.fromkeys(search, 0)` is arguably more readable, efficient and backwards compatible... – Jon Clements Apr 28 '14 at 14:27
0
counts={}
for s in search:
    lower_s=s.lower()  
    counts[lower_s]=str(data_items.count(lower_s))

That's if you are ok with using a dictionary (since you said structure, it's a better choice).

carefullynamed
  • 437
  • 4
  • 16
  • Thanks, this seems to work, but I need to convert my data_items to lowercase so the match works. Where, would I do this? I tried using the .lower() but not sure where to apply it – user1513388 Apr 28 '14 at 14:05
  • This code won't yield the expected results given by the OP, having `123data` in `data_items` should count as a match if `search` contains just `123`. – Frerich Raabe Apr 28 '14 at 14:16
  • True,I didn't notice that in the OP. – carefullynamed Apr 28 '14 at 14:33