4

I have a python dictionary that contains a list of terms as values:

myDict = {
    ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|apple\w)', '(bird|tree|panda)'],
    ID_2: ['(horse|building|computer)', '(panda\w|lion)'],
    ID_3: ['(wagon|tiger|cat\w*)'],
    ID_4: ['(dog)']    
    }

I want to be able to read the the list-items in each value as individual regular expressions and if they match any text, have the matched text returned as keys in a separate dictionary with their original keys (the IDs) as the values.

So if these terms were read as regexes for searching this string:

"dog panda cat cats pandas car carts"

The general approach I have in mind is something like:

for key, value in myDict:
    for item in value:
        if re.compile(item) = match-in-text:
            newDict[match] = [list of keys]

The expected output would be:

newDict = {
    car: [ID_1],
    carts: [ID_1],
    dog: [ID_1, ID_4],
    panda: [ID_1, ID_2],
    pandas: [ID_1, ID_2],
    cat: [ID_1, ID_3],
    cats: [ID_1, ID_3]
    }

The matched text should be returned as a key in newDict only if they've actually matched something in the body of text. So in the output, 'Carts' is listed there since the regex in ID_1's values matched with it. And therefore the ID is listed in the output dict.

outis
  • 75,655
  • 22
  • 151
  • 221
Silent-J
  • 322
  • 1
  • 4
  • 15
  • can you provide an example with the expected output ? – scharette Oct 26 '17 at 17:44
  • @scharette newDict is what I'm hoping to achieve as an output. – Silent-J Oct 26 '17 at 17:45
  • **To provide more context - the values of myDict contain a list of RegExes. They're being run against a collections of texts and in the end, only the matches of these RegExes should be returned. Sorry for the confusion and not providing more info in the question but thanks to everyone who has provided answers already. But unfortunately this isn't something that can be done with simple string formatting. It NEEDS to be done by running these terms as regexes.** – Silent-J Oct 26 '17 at 18:00
  • Why isn't cars or apples in the `newDict` output? – Andy Hayden Oct 26 '17 at 18:04
  • @AndyHayden I've provided more info in the question. – Silent-J Oct 26 '17 at 18:15
  • @J_Micks. Did you see my answer? I think it should do what you want. – ekhumoro Oct 26 '17 at 18:23

2 Answers2

3

Here's a simple script that seems to fit your requirements:

import re
from collections import defaultdict

text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""

myDict = {
    'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
    'ID_2': ['(horse|building|computer)', '(panda|lion)'],
    'ID_3': ['(wagon|tiger|cat)'],
    'ID_4': ['(dog)'],
    }

newDict = defaultdict(list)

for key, values in myDict.items():
    for pattern in values:
        for match in re.finditer(pattern, text):
            newDict[match.group(0)].append(key)

for item in newDict.items():
    print(item)

output:

('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • This worked perfectly. Thanks so much for the quick reply. I modified it a bit so that instead of only getting the first instance of the pattern in text I got ALL instances of the pattern in the text by inserting: "if match is not None: for g in match: screen = re.search(pattern, g) newDict[screen.group(0)].append(key). – Silent-J Oct 26 '17 at 18:40
  • @J_Micks. I did wonder about that, but it wasn't clear from your question. I have amended my answer so that it gets all matches for each pattern. – ekhumoro Oct 26 '17 at 18:46
  • @ekhumoro: Out of curiosity: could this be done with a dict comprehension? – Jan Oct 26 '17 at 18:50
  • @Jan. Not really. Several patterns can match the same thing, so the output dict needs to be continually updated as new matches are found. A dictcomp would overwrite any previous matches. I suppose it could be done using side-effects on a separate dict - but I would say that doesn't really count as a dictcomp. – ekhumoro Oct 26 '17 at 18:59
1

One way is to convert the regex into vanilla lists e.g. with string manipulation:

In [11]: {id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()}
Out[11]:
{'ID_1': ['dog',
  'cat',
  'horse',
  'car',
  'house',
  'apples',
  'bird',
  'tree',
  'panda'],
 'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
 'ID_3': ['wagon', 'tiger', 'cat'],
 'ID_4': ['dog']}

You can make this into a DataFrame:

In [12]: from collections import Counter

In [13]: pd.DataFrame({id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()}).fillna(0).astype(int)
Out[13]:
          ID_1  ID_2  ID_3  ID_4
apples       1     0     0     0
bird         1     0     0     0
building     0     1     0     0
car          1     0     0     0
cat          1     0     1     0
computer     0     1     0     0
dog          1     0     0     1
horse        1     1     0     0
house        1     0     0     0
lion         0     1     0     0
panda        1     1     0     0
tiger        0     0     1     0
tree         1     0     0     0
wagon        0     0     1     0
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • hey Andy, but the items within the lists need to be searched across some bodies of texts, and only if they end up matching anything within the text should they be returned with their IDs they were originally linked with. I'm really sorry I didn't provide that information as vital sooner and thanks so much for taking the time to respond! – Silent-J Oct 26 '17 at 18:01
  • @J_Micks please update your question with an example regex. Why is there a list of regex (it just has to match one from the list)? This question is not particularly clear. – Andy Hayden Oct 26 '17 at 18:03