2

So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'

I want to replace all country instances except when we have an 'of ' or 'of the '

Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.

If these do exist then we do not remove the country, else we do remove country. The examples will become:

'Bank of China', or 'Embassy of China', and 'International'

However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.

Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?

My function is along these lines:

def removeCountry(name):
    for country in countries:
        if country in name:
            if 'of ' + country in name:
                return name
            if 'of the ' + country in name:
                return name
            else:
                name =  re.sub(country + '$', '', name).strip()
                return name
    return name

EDIT: I did find some info here. This does describe how to do an if, but I really want a if not 'of ' if not 'of the ' then replace...

redrubia
  • 2,256
  • 6
  • 33
  • 47
  • Is there a reason you need to accomplish both the conditional and substitution in one single regular expression? Or, is this an abstract example? – Cuadue Feb 13 '14 at 23:07

4 Answers4

1

I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.

Something like

countries = [
    "Afghanistan",
    "Albania",
    "Algeria",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antigua",
    "Arabia",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "China",
    "Russia"
    # etc
]

def find_words_from_set_in_string(set_):
    set_ = set(set_)
    def words_in_string(s):
        return set_.intersection(s.split())
    return words_in_string

get_countries = find_words_from_set_in_string(countries)

then

get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")

returns

set(['Argentina', 'China', 'Russia'])

... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.

As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\""). You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.

Community
  • 1
  • 1
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
1

You could compile a few sets of regular expressions, then pass your list of input through them. Something like: import re

countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]

def remove_country(s):
    for regex in takes:
        if regex.search(s):
            return s
    for regex in subs:
        s = regex.sub('', s)
    return s

print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')

''' Output:
    the bank of foo
    the bank of the baz
    the nation
'''

It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.

Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.

Cuadue
  • 3,769
  • 3
  • 24
  • 38
  • Thank you! This looks like a good solution. Also I can speed it up by initially checking if a country even exists in s – redrubia Feb 13 '14 at 23:16
  • 1
    @redrubia Are you sure this solution is okay? I just tested it and it looks like it doesn't return the correct result: `remove_country('Embassy of China')` results in `''` (empty string). Instead of `regex.match(s)` it should be `regex.find(s)`. (And `regex.replace` should be `regex.sub`). – Bakuriu Feb 13 '14 at 23:20
  • Pardon: `re.search(s)`. – Bakuriu Feb 13 '14 at 23:26
  • yeah this is incorrect answer, just tried it too and seems it produces a null string – redrubia Feb 14 '14 at 00:02
  • Couple of typos, yes. I guess you're looking for a plug and play solution... Anyway it works and I've posted copy-pasteable code for you. – Cuadue Feb 14 '14 at 00:16
0

The re.sub function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:

import re

def make_regex(countries):
    escaped = (re.escape(country) for country in countries)
    states = '|'.join(escaped)
    return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))

def remove_name(match):
    name = match.group()
    if name.lstrip().startswith('of'):
        return name
    else:
        return name.replace(match.group('state'), '').strip()

regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'

The result might contain some spurious space (in the above case a last strip() is needed). You can fix this modifying the regex to:

\s*(of(\sthe)?\s)?(?P<state>({}))

To catch the spaces before of or before the country name and avoid the bad spacing in the output.

Note that this solution can handle a whole text, not just text of the form Something of Country and Something Country. For example:

In [38]: regex = make_regex(['China'])
    ...: text = '''This is more complex than just "Embassy of China" and "International China"'''

In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'

an other example usage:

In [33]: countries = [
    ...:     'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
    ...:     'France', 'Italy', 'Australia', 'New Zealand', 'Brazil', 
    ...:     'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
    ...:     'Spain', 'Portugal', 'Argentina', 'San Marino'
    ...: ]

In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'

In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)

In [36]: regex = make_regex(countries)
    ...: result = regex.sub(remove_name, text)

In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • seems this solution has errors in Python 2.7 where it comes up: "sorry, but this version only supports 100 named groups" – redrubia Feb 14 '14 at 00:05
  • @redrubia Fixed. However I believe the message of the error is bogus. The regex always had 1 *named* group, but many non-named groups. – Bakuriu Feb 14 '14 at 06:55
0

Not tested:

def removeCountry(name):
    for country in countries:
          name =  re.sub('(?<!of (the )?)' + country + '$', '', name).strip()

Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the

rbernabe
  • 1,062
  • 8
  • 10