Language detection using pycld2

Question

I am trying to use the pycld2 package to detect multiple languages in text. This is the example I am testing out:

import pycld2 as cld2

text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: Connessione push-in. Terminare solido e incagliato (trefoli di classe B 7 o meno), così come i conduttori a puntale, semplicemente spingendoli in – nessun attrezzo richiesto. Der universelle Anschluss mit zusätzlichem Vorteil: Push-in-Anschluss Vollständig und verseilt abschließen (Klasse B 7 Stränge oder weniger), sowie Aderendhülsen durch einfaches Aufschieben in – kein Werkzeug erforderlich.'''

reliable, index, top_3_choices,vecs = cld2.detect(text, returnVectors=True)

The top 3 detected languages are the following:

print(top_3_choices)
(('GERMAN', 'de', 34, 1089.0), ('ITALIAN', 'it', 33, 355.0), ('ENGLISH', 'en', 32, 953.0))

According to the documentation the confidence score is the fourth argument in each tuple and the third argument corresponds to the percentage of the original text detected in the respective language. I am struggling though how to interpret the score so I can flag the confidence of the detected language. Can I somehow normalize the score to get some form of interpretable probabilities?

why dont you sum all the scores and then divide each individual score to get its probability ? — bill, May 05 '22 at 16:45
The fourth argument gives all the languages that have been detected (without a score as i understand) so it could be that there are other languages present and i guess getting probabilities in this way won't be correct — natt010, May 05 '22 at 17:17

Language detection using pycld2

0 Answers0