I wanted to achieve something like this..
input_text = "The body is burnt"
output = "en-uk"
input_text = "The body is burned"
output = "en-us"
I wanted to achieve something like this..
input_text = "The body is burnt"
output = "en-uk"
input_text = "The body is burned"
output = "en-us"
Try TextBlob This requires NLTK package, uses Google
from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()
A side note this uses Google translate API so it requires internet
Similar to this answer you could use the American-British translator.
import re
import requests
url = "https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/"
# The two dictionaries differ slightly so we import both
uk_to_us = requests.get(url + "british_spellings.json").json()
us_to_uk = requests.get(url + "american_spellings.json").json()
us_only = requests.get(url + "american_only.json").json()
uk_only = requests.get(url + "british_only.json").json()
# Save these word lists in a local text file if you want to avoid requesting the data every time
uk_words = set(uk_to_us) | set(uk_only)
us_words = set(us_to_uk) | set(us_only)
uk_phrases = {w for w in uk_words if len(w.split()) > 1}
us_phrases = {w for w in us_words if len(w.split()) > 1}
uk_words -= uk_phrases
us_words -= us_phrases
max_length = max(len(word.split()) for word in uk_phrases | us_phrases)
def get_dialect(s):
words = re.findall(r"([a-z]+)", s.lower()) # list of lowercase words only
uk = 0
us = 0
# Check for multi-word phrases first, removing them if they are found
for length in range(max_length, 1, -1):
i = 0
while i + length <= len(words):
phrase = " ".join(words[i:i+length])
if phrase in uk_phrases:
uk += length
words = words[:i] + words[i + length:]
elif phrase in us_phrases:
us += length
words = words[:i] + words[i + length:]
else:
i += 1
# Add single words
uk += sum(word in uk_words for word in words)
us += sum(word in us_words for word in words)
print("Scores", uk, us)
if uk > us:
return "en-uk"
if us > uk:
return "en-us"
return "Unknown"
print(get_dialect("The color of the ax")) # en-us
print(get_dialect("The colour of the axe")) # en-uk
print(get_dialect("I opened my brolly on the zebra crossing")) #en-uk
print(get_dialect("The body is burnt")) # Unknown
This simply tests at the individual word level and cannot check for differences in how words are used in grammatical context (e.g. some words used only as an adjective in one dialect but can also be a past tense verb in the other).
The us_only
and uk_only
lists do not contain different forms of the same words (e.g. "abseil" is there but not "abseiled", "abseiling" etc.) so you would ideally convert your text to stems first.