I created a function that combs strings for mentions of countries. This is based on a .txt file that contains many different way people mention a country in the text. The file looks like this:
"afghanistan": ["afghan", "afghans"], "albania": ["albanian", "albanians"], "algeria": ["algerian", "algerians"], "angola": ["angolan", "angolans"],
...
and so on, for every country on earth.
I then created a function that combs the string and searches for the mentions - but it runs a bit slow on large datasets, and i really want to make the function run faster - but I don't know how.
The function looks like this:
import json
import string
from re import sub
from typing import List, Union
def find_countries(text: str, exclude: Union[str, List[str]] = [], extra: Union[str, List[str]] = []) -> Union[List[str], str]:
"""
Parameters
----------
`text` : `str`
The text to extract countries from.
`exclude` : `list or str`
Optional. Countries to exclude from search.
`extra` : `list or str`
Optional. Additional terms to search for (usually orgs).
"""
# Load country names from file
with open('country_names.txt') as file:
country_names = json.load(file)
# Convert 'exclude' and 'extra' to lists
exclude = [exclude] if isinstance(exclude, str) else exclude
extra = [extra] if isinstance(extra, str) else extra
# Include 'extra' countries or orgs
for i in extra:
country_names[i.lower()] = []
# Remove 'exclude' countries using set operations
exclude_set = set(exclude)
countries = {country for country in country_names.keys() if country.lower() not in exclude_set}
# Clean and preprocess the input text
my_punct = string.punctuation + '”“'
replace_punct_string = "['’-]"
text = sub(replace_punct_string, " ", text)
text = text.translate(str.maketrans('', '', my_punct)).lower()
#Search for country mentions using a set comprehension
countries_mentioned = {country for country in countries
if any(f' {name} ' in f' {text} ' for name in {country} | set(country_names[country]))}
return list(countries_mentioned)
The function recieves a string and combs it for mentions of countries, which it then returns as a list of countries. I usually apply it to a Pandas Series.
I think that code as it is now is "fine" - it isn't long and it does the job. I wonder and hope that you can help me make it run faster so that when i apply it to tens of thousands of texts it wont years to finish. Also - any tips on writing better code will help a lot!