5

I have a string and a dictionary, I have to replace every occurrence of the dict key in that text.

text = 'I have a smartphone and a Smart TV'
dict = {
    'smartphone': 'toy',
    'smart tv': 'junk'
}

If there is no space in keys, I will break the text into word and compare one by one with dict. Look like it took O(n). But now the key have space inside it so thing is more complected. Please suggest me the good way to do this and please notice the key may not match case with the text.

Update

I have think of this solution but it not efficient. O(m*n) or more...

for k,v in dict.iteritems():
    text = text.replace(k,v) #or regex...
James
  • 13,571
  • 6
  • 61
  • 83

5 Answers5

2

If the key word in the text is not close to each others (keyword other keyword) we may do this. Took O(n) to me >"<

def dict_replace(dictionary, text, strip_chars=None, replace_func=None):
    """
        Replace word or word phrase in text with keyword in dictionary.

        Arguments:
            dictionary: dict with key:value, key should be in lower case
            text: string to replace
            strip_chars: string contain character to be strip out of each word
            replace_func: function if exist will transform final replacement.
                          Must have 2 params as key and value

        Return:
            string

        Example:
            my_dict = {
                "hello": "hallo",
                "hallo": "hello",    # Only one pass, don't worry
                "smart tv": "http://google.com?q=smart+tv"
            }
            dict_replace(my_dict, "hello google smart tv",
                         replace_func=lambda k,v: '[%s](%s)'%(k,v))
    """

    # First break word phrase in dictionary into single word
    dictionary = dictionary.copy()
    for key in dictionary.keys():
        if ' ' in key:
            key_parts = key.split()
            for part in key_parts:
                # Mark single word with False
                if part not in dictionary:
                    dictionary[part] = False

    # Break text into words and compare one by one
    result = []
    words = text.split()
    words.append('')
    last_match = ''     # Last keyword (lower) match
    original = ''       # Last match in original
    for word in words:
        key_word = word.lower().strip(strip_chars) if \
                   strip_chars is not None else word.lower()
        if key_word in dictionary:
            last_match = last_match + ' ' + key_word if \
                         last_match != '' else key_word
            original = original + ' ' + word if \
                       original != '' else word
        else:
            if last_match != '':
                # If match whole word
                if last_match in dictionary and dictionary[last_match] != False:
                    if replace_func is not None:
                        result.append(replace_func(original, dictionary[last_match]))
                    else:
                        result.append(dictionary[last_match])
                else:
                    # Only match partial of keyword
                    match_parts = last_match.split(' ')
                    match_original = original.split(' ')
                    for i in xrange(0, len(match_parts)):
                        if match_parts[i] in dictionary and \
                           dictionary[match_parts[i]] != False:
                            if replace_func is not None:
                                result.append(replace_func(match_original[i], dictionary[match_parts[i]]))
                            else:
                                result.append(dictionary[match_parts[i]])
            result.append(word)
            last_match = ''
            original = ''

    return ' '.join(result)
James
  • 13,571
  • 6
  • 61
  • 83
1

If your keys have no spaces:

output = [dct[i] if i in dct else i for i in text.split()]

' '.join(output)

You should use dct instead of dict so it doesn't collide with the built in function dict()

This makes use of a dictionary comprehension, and a ternary operator to filter the data.

If your keys do have spaces, you are correct:

for k,v in dct.iteritems():
    string.replace('d', dct[d])

And yes, this time complexity will be m*n, as you have to iterate through the string every time for each key in dct.

mindink
  • 25
  • 4
  • Key has space, so you cannot split – James Feb 02 '16 at 02:01
  • string replace will fail if dict have something like this my_dict={"google": "yahoo", "yahoo": "google"} and text "google is bigger than yahoo" – James Feb 04 '16 at 19:23
0

Drop all dictionary keys and the input text to lower case, so the comparisons are easy. Now ...

for entry in my_dict:
    if entry in text:
        # process the match

This assumes that the dictionary is small enough to warrant the match. If, instead, the dictionary is large and the text is small, you'll need to take each word, then each 2-word phrase, and see whether they're in the dictionary.

Is that enough to get you going?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • The dict may have 3 words, 4 words... who knows. And your algorithm is not efficient. – James Feb 02 '16 at 02:04
  • I believe it's **O(n)** for bounded number of words. If it's limited only by the input length, then it's **O(n^2)** -- but given punctuation to break up the phrases in the input, **n** is rather limited, too. Is this tractable for your application? – Prune Feb 02 '16 at 02:09
  • if entry in text took more than O(n) to compare and for entry in my dict took another O(m) so would be O(n*m) – James Feb 02 '16 at 17:30
0

You need to test all the neighbor permutations from 1 (each individual word) to len(text) (the entire string). You can generate the neighbor permutations this way:

text = 'I have a smartphone and a Smart TV'

array = text.lower().split()

key_permutations = [" ".join(array[j:j + i]) for i in range(1, len(array) + 1) for j in range(0, len(array) - (i - 1))]

>>> key_permutations
['i', 'have', 'a', 'smartphone', 'and', 'a', 'smart', 'tv', 'i have', 'have a', 'a smartphone', 'smartphone and', 'and a', 'a smart', 'smart tv', 'i have a', 'have a smartphone', 'a smartphone and', 'smartphone and a', 'and a smart', 'a smart tv', 'i have a smartphone', 'have a smartphone and', 'a smartphone and a', 'smartphone and a smart', 'and a smart tv', 'i have a smartphone and', 'have a smartphone and a', 'a smartphone and a smart', 'smartphone and a smart tv', 'i have a smartphone and a', 'have a smartphone and a smart', 'a smartphone and a smart tv', 'i have a smartphone and a smart', 'have a smartphone and a smart tv', 'i have a smartphone and a smart tv']

Now we substitute through the dictionary:

import re

for permutation in key_permutations:
    if permutation in dict:
        text = re.sub(re.escape(permutation), dict[permutation], text, flags=re.IGNORECASE)

>>> text
'I have a toy and a junk'

Though you'll likely want to try the permutations in the reverse order, longest first, so more specific phrases have precedence over individual words.

cdlane
  • 40,441
  • 5
  • 32
  • 81
0

You can do this pretty easily with regular expressions.

import re

text = 'I have a smartphone and a Smart TV'
dict = {
    'smartphone': 'toy',
    'smart tv': 'junk'
}

for k, v in dict.iteritems():
    regex = re.compile(re.escape(k), flags=re.I)
    text = regex.sub(v, text)

It still suffers from the problem of depending on processing order of the dict keys, if the replacement value for one item is part of the search term for another item.

Brendan Abel
  • 35,343
  • 14
  • 88
  • 118