Some days ago, I am developing an java server to keep a bunch of data and identify its language, so I decided to use lingpipe for such task. But I have facing an issue, after training code and evaluating it with two languages(English and Spanish) by getting that I can't identify spanish text, but I got a successful result with english and french.
The tutorial that I have followed in order to complete this task is: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
An the next steps I have made in order to complete the task: Steps followed to train a Language Classifier
~1.First place and unpack the english and spanish metadata inside a folder named leipzig, as follow (Note: Metadata and Sentences are provided from http://wortschatz.uni-leipzig.de/en/download):
leipzig //Main folder
1M sentences //Folder with data of the last trial
eng_news_2015_1M
eng_news_2015_1M.tar.gz
spa-hn_web_2015_1M
spa-hn_web_2015_1M.tar.gz
ClassifyLang.java //Custom program to try the trained code
dist //Folder
eng_news_2015_300K.tar.gz //unpackaged english sentences
spa-hn_web_2015_300K.tar.gz //unpackaged spanish sentences
EvalLanguageId.java
langid-leipzig.classifier //trained code
lingpipe-4.1.2.jar
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
Munge.java
TrainLanguageId.java
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~2.Second unpack the language metadata compressed into a unpack folder
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~3.Then Munge the sentences of each one in order to remove the line numbers, tabs and replacing line breaks with single space characters. The output is uniformly written using the UTF-8 unicode encoding (Note:the munge.java at Lingpipe site).
/-----------------Command line----------------------------------------------/
javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166
eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257
/---------------------------------------------------------------/
<---------------------------------Folder------------------------------------->
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
<-------------------------------------------------------------------------->
~4.Next we start by training the language(Note:the TrainLanguageId.java at Lingpipe LanguageId tutorial).
/---------------Command line--------------------------------------------/
javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa
Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier
/----------------------------------------------------------------------------/
~5. We evaluated our trained code with the next result, having some issues on the confusion matrix (Note:the EvalLanguageId.java at Lingpipe LanguageId tutorial).
/------------------------Command line---------------------------------/
javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------
Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
,eng,spa
eng,1000,0 <---------- not diagonal sampling
spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
the following symmetries are expected:
TP=TN, FN=FP
PosRef=PosResp=NegRef=NegResp
Acc=Prec=Rec=F
Total=4000
True Positive=1000
False Negative=1000
False Positive=1000
True Negative=1000
Positive Reference=2000
Positive Response=2000
Negative Reference=2000
Negative Response=2000
Accuracy=0.5
Recall=0.5
Precision=0.5
Rejection Recall=0.5
Rejection Precision=0.5
F(1)=0.5
Fowlkes-Mallows=2000.0
Jaccard Coefficient=0.3333333333333333
Yule's Q=0.0
Yule's Y=0.0
Reference Likelihood=0.5
Response Likelihood=0.5
Random Accuracy=0.5
Random Accuracy Unbiased=0.5
kappa=0.0
kappa Unbiased=0.0
kappa No Prevalence=0.0
chi Squared=0.0
phi Squared=0.0
Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN
ONE VERSUS ALL EVALUATIONS BY CATEGORY
CATEGORY[0]=eng VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=1000
False Negative=0
False Positive=1000
True Negative=0
Positive Reference=1000
Positive Response=2000
Negative Reference=1000
Negative Response=0
Accuracy=0.5
Recall=1.0
Precision=0.5
Rejection Recall=0.0
Rejection Precision=NaN
F(1)=0.6666666666666666
Fowlkes-Mallows=1414.2135623730949
Jaccard Coefficient=0.5
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=1.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
CATEGORY[1]=spa VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=0
False Negative=1000
False Positive=0
True Negative=1000
Positive Reference=1000
Positive Response=0
Negative Reference=1000
Negative Response=2000
Accuracy=0.5
Recall=0.0
Precision=NaN
Rejection Recall=1.0
Rejection Precision=0.5
F(1)=NaN
Fowlkes-Mallows=NaN
Jaccard Coefficient=0.0
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=0.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
/-----------------------------------------------------------------------/
~6.Then we tried to make a real evaluation with spanish text:
/-------------------Command line----------------------------------/
javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang
/-------------------------------------------------------------------------/
<---------------------------------Result------------------------------------>
Text: Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión.
Best Language: eng <------------- Wrong Result
<----------------------------------------------------------------------->
Code for ClassifyLang.java:
import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;
import com.aliasi.lm.NGramProcessLM;
import com.aliasi.util.AbstractExternalizable;
import java.io.File;
import java.io.IOException;
import com.aliasi.util.Files;
public class ClassifyLang {
public static String text = "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
+ " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
+ " y de ahora en adelante estaré enfocado y optimista"
+ " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo. ";
private static File MODEL_DIR
= new File("/home/samuel/leipzig/langid-leipzig.classifier");
public static void main(String[] args)
throws ClassNotFoundException, IOException {
System.out.println("Text: " + text);
LMClassifier classifier = null;
try {
classifier = (LMClassifier) AbstractExternalizable.readObject(MODEL_DIR);
} catch (IOException | ClassNotFoundException ex) {
// Handle exceptions
System.out.println("Problem with the Model");
}
Classification classification = classifier.classify(text);
String bestCategory = classification.bestCategory();
System.out.println("Best Language: " + bestCategory);
}
}
~7.I tried with a 1 million metadata file, but it got the same result and also changing the ngram number by getting the same results. I will be so thankfull for your help.