How to determine if a particular string or text is in US-english or UK-english using python?

Question

I wanted to achieve something like this..

input_text = "The body is burnt"

output = "en-uk"

input_text = "The body is burned" 

output = "en-us"

What should happen with mixed inputs? eg. "The color is burnt" — Mathias R. Jessen, May 26 '22 at 10:48
@MathiasR.Jessen The answer can be whose percentage is more or if both are equal then answer can be any. — qoder, May 26 '22 at 10:51
You should [edit] your question to show us what you've tried and a [mre] of the specific problem you're facing, but I disagree with the premise of your question. From my understanding, in BrE, both "burned" and "burnt" are equally acceptable. (In AmE, "burnt" is always an adjective.) — Julia, May 26 '22 at 10:51
Your particular example does not seem to be picked up by most British-American translators as both words can be used in both dialects. — Stuart, May 26 '22 at 10:52

score 0 · Answer 1 · answered May 26 '22 at 10:49

0

Try TextBlob This requires NLTK package, uses Google

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

A side note this uses Google translate API so it requires internet

answered May 26 '22 at 10:49

Damien

61
5

this doesn't work for me (forbidden response) - I think because you need Google credentials to access the API. The TextBlob authors recommend using the official Google Translate API instead https://cloud.google.com/translate/docs/basic/detecting-language#translate_detect_language-drest – Stuart May 26 '22 at 12:02
Yea. Looked at it again they recommend to use translate api https://stackoverflow.com/questions/33107292/httperror-http-error-503-service-unavailable-goslate-language-detection-reques/33448911 – Damien May 26 '22 at 18:12

Stuart · Answer 2 · 2022-05-26T13:35:12.613

Similar to this answer you could use the American-British translator.

import re
import requests
url = "https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/"
# The two dictionaries differ slightly so we import both
uk_to_us = requests.get(url + "british_spellings.json").json()    
us_to_uk = requests.get(url + "american_spellings.json").json()   
us_only = requests.get(url + "american_only.json").json()
uk_only = requests.get(url + "british_only.json").json()

# Save these word lists in a local text file if you want to avoid requesting the data every time
uk_words = set(uk_to_us) | set(uk_only)
us_words = set(us_to_uk) | set(us_only)
uk_phrases = {w for w in uk_words if len(w.split()) > 1}
us_phrases = {w for w in us_words if len(w.split()) > 1}
uk_words -= uk_phrases
us_words -= us_phrases
max_length = max(len(word.split()) for word in uk_phrases | us_phrases)

def get_dialect(s):
    words = re.findall(r"([a-z]+)", s.lower()) # list of lowercase words only
    uk = 0
    us = 0 
    # Check for multi-word phrases first, removing them if they are found
    for length in range(max_length, 1, -1):
        i = 0
        while i + length <= len(words):
            phrase = " ".join(words[i:i+length])
            if phrase in uk_phrases:
                uk += length
                words = words[:i] + words[i + length:]
            elif phrase in us_phrases:
                us += length
                words = words[:i] + words[i + length:]
            else:
                i += 1
    
    # Add single words
    uk += sum(word in uk_words for word in words)
    us += sum(word in us_words for word in words)
    print("Scores", uk, us)
    if uk > us:
        return "en-uk"
    if us > uk:
        return "en-us"
    return "Unknown"

print(get_dialect("The color of the ax"))  # en-us
print(get_dialect("The colour of the axe"))  # en-uk
print(get_dialect("I opened my brolly on the zebra crossing"))  #en-uk
print(get_dialect("The body is burnt"))  # Unknown

This simply tests at the individual word level and cannot check for differences in how words are used in grammatical context (e.g. some words used only as an adjective in one dialect but can also be a past tense verb in the other).

The us_only and uk_only lists do not contain different forms of the same words (e.g. "abseil" is there but not "abseiled", "abseiling" etc.) so you would ideally convert your text to stems first.

How to determine if a particular string or text is in US-english or UK-english using python?

2 Answers2