2

I'm trying to use GeoText to genrate country mentions, but cities like Rio de Janeiro, Rio das Ostras are not recognized. I verified the dictionary and those cities are OK in there.

  • Input:

text = "Rio de Janeiro, Las Vegas, New York" geo = GeoText(text) print(geo.cities)

  • Output:

    • ['Las Vegas', 'New York']
  • Expected Output:

    • ['Rio de Janeiro','Las Vegas','New York']

Using python 3.x and geotext 0.3.0

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186
  • Please provide a working example of what you want to do. It will be much easier for someone to help you that way. – Anoroah Jun 13 '18 at 17:05

1 Answers1

1

The regex on the GitHub repo and the latest pip installed version (0.3.0) are different.

In[2]: import re
In[3]: text = "Rio de Janeiro, Las Vegas, New York"

# old regex (pip installed)
In[4]: city_regex = r"[A-Z]+[a-zà-ú]*(?:[ '-][A-Z]+[a-zà-ú]*)*"
In[5]: re.findall(city_regex, text)
Out[5]: ['Rio', 'Janeiro', 'Las Vegas', 'New York']

# new regex (GitHub)
In[6]: city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
In[7]: re.findall(city_regex, text)
Out[7]: ['Rio de Janeiro', 'Las Vegas', 'New York']

The GitHub repos regex seems to work fine even for three word cities but it isn't being used in the latest version on PyPI.

G_M
  • 3,342
  • 1
  • 9
  • 23