0

I created a function that combs strings for mentions of countries. This is based on a .txt file that contains many different way people mention a country in the text. The file looks like this:

"afghanistan": ["afghan", "afghans"], "albania": ["albanian", "albanians"], "algeria": ["algerian", "algerians"], "angola": ["angolan", "angolans"], ... and so on, for every country on earth.

I then created a function that combs the string and searches for the mentions - but it runs a bit slow on large datasets, and i really want to make the function run faster - but I don't know how.

The function looks like this:



import json

import string

from re import sub

from typing import List, Union

 

def find_countries(text: str, exclude: Union[str, List[str]] = [], extra: Union[str, List[str]] = []) -> Union[List[str], str]:

   """

   Parameters

   ----------

   `text` : `str`

       The text to extract countries from.

   `exclude` : `list or str`

       Optional. Countries to exclude from search.

   `extra` : `list or str`

       Optional. Additional terms to search for (usually orgs).

   """

 

   # Load country names from file

   with open('country_names.txt') as file:

       country_names = json.load(file)

 

   # Convert 'exclude' and 'extra' to lists

   exclude = [exclude] if isinstance(exclude, str) else exclude

   extra = [extra] if isinstance(extra, str) else extra

 

   # Include 'extra' countries or orgs

   for i in extra:

       country_names[i.lower()] = []

 

   # Remove 'exclude' countries using set operations

   exclude_set = set(exclude)

   countries = {country for country in country_names.keys() if country.lower() not in exclude_set}

 

   # Clean and preprocess the input text

   my_punct = string.punctuation + '”“'

   replace_punct_string = "['’-]"

   text = sub(replace_punct_string, " ", text)

   text = text.translate(str.maketrans('', '', my_punct)).lower()

 

   #Search for country mentions using a set comprehension

   countries_mentioned = {country for country in countries

                             if any(f' {name} ' in f' {text} ' for name in {country} | set(country_names[country]))}

 

   return list(countries_mentioned)

The function recieves a string and combs it for mentions of countries, which it then returns as a list of countries. I usually apply it to a Pandas Series.

I think that code as it is now is "fine" - it isn't long and it does the job. I wonder and hope that you can help me make it run faster so that when i apply it to tens of thousands of texts it wont years to finish. Also - any tips on writing better code will help a lot!

GuyBecker
  • 11
  • 2
  • Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Community Aug 16 '23 at 14:15

1 Answers1

0

You do a lot of converting on-the-fly which seems to me completely unnecessary. You really should provide things as sets if you use only set functionality. If I'm seeing this correctly you don't need the ordering of the list so just fill sets into the arguments rather than lists. With this you can save all the conversion stuff inside the function.

Additionally, if the file is not too large and you are using the function a lot of times, you can save much performance by loading the data only once globally and saving it in memory instead of reloading it all the time inside the function. You could e.g. create a data structure which loads those data automatically and caches it to prevent reloads. The @property decorator is well-suited for such use-cases.

I would also create a dictionary which maps the variants to the correct value. Something like

{
    "afghan": "afghanistan",
    "afghans": "afghanistan",
    # ...
}

With this you can save / outsource one loop in your function.

One warning though: You should almost never use an empty list in the argument list as default value. Here is why - found at this SO Post

Edit

Actually, this "flipped" dictionary is not helpful. As you mentioned there was also a problem with matching subwords e.g. Oman in woman. You can prevent this and eventually even speed things up a bit using regex (I don't actually know, didn't do a performance test).

import itertools
import json
from typing import Optional, Iterable
from regex import regex


class CountryProvider:
    def __init__(self):
        self._countries: Optional[set[str]] = None
        self._patterns: Optional[dict[str, regex.Pattern]] = None

    def _load_countries(self):
        with open("country_names.txt") as file:
            countries = json.load(file)
        self._patterns = {
            country: regex.compile(
                rf"\b({country}|" + "|".join(variants) + r")\b", regex.IGNORECASE
            )
            for country, variants in countries
        }
        self._countries = set(countries.keys())

    @property
    def countries(self) -> set[str]:
        if self._countries is None:
            self._load_countries()
        return self._countries

    @property
    def patterns(self) -> dict[str, regex.Pattern]:
        if self._patterns is None:
            self._load_countries()
        return self._patterns


COUNTRY_PROVIDER = CountryProvider()


def find_countries(
    text: str,
    exclude: Optional[list[str]] = None,
    extra: Optional[dict[str, list[str]]] = None,
) -> list[str]:
    # preprocess text input
    # set empty list for exclude and extra if they are None
    countries = COUNTRY_PROVIDER.countries
    patterns = COUNTRY_PROVIDER.patterns
    extra_patterns = {
        country: regex.compile(
            rf"\b({country}|" + "|".join(variants) + r")\b", regex.IGNORECASE
        )
        for country, variants in extra
        if country not in exclude
    }
    mentioned_countries: list[str] = []
    for country in countries:
        if country in exclude:
            continue
        if regex.search(patterns[country], text, regex.IGNORECASE) is not None:
            mentioned_countries.append(country)
    for country in extra:
        if regex.search(extra_patterns[country], text, regex.IGNORECASE) is not None:
            mentioned_countries.append(country)
    return mentioned_countries

Note that the patterns dictionary contains a regex pattern for each country which should match all variants.

lord_haffi
  • 61
  • 7
  • Hi, thank you for your answer. I have a couple follow up questions. Why will creating the class save performance time? I have worked much with classes, and i would appreciate if you could help me understand it better. Secondly, you suggest "filliping" the dictionary of countries - how could I accommodate for that change in my code? Can you help me with that a little bit? – GuyBecker Aug 16 '23 at 20:05
  • Using this class will save you performance time if you use this function more than once. This is because you would read the file and build up your search index only once. You could e.g. pass an instance to the function or create an instance as global variable. It's just one way to implement cache functionality. Additionally, looking again at it with a fresh mind I think you actually don't need to flip the dictionary. I will edit my post. – lord_haffi Aug 16 '23 at 20:32
  • I tried it out again, but now the problem is that some countries get detected accidentally. For example, in your version of the function, Oman will get detected every time the word "Woman" is in the text. Any idea on maybe how to negate that? – GuyBecker Aug 17 '23 at 17:33
  • Good point. In this case I would suggest to use a regex engine. If you care much about performance you could use [regex](https://pypi.org/project/regex/) instead of the standard `re` module. – lord_haffi Aug 17 '23 at 20:38
  • tried it out and didnt speed it up much, but ill keep looking into it in my search for faster runtimes. Thank You! – GuyBecker Aug 18 '23 at 17:57
  • Things you could do, to make it faster from this point: You could use multithreading (`regex` releases the GIL, so it would run in parallel). You can write this piece of code in C or Rust or similar - Python is kinda slow sometimes. – lord_haffi Aug 18 '23 at 20:56