I am working on web page language detection and I managed to retrieve the content of a page through other middleware I developed. since there is no standardization the content location. However, I don't know how to detect the language I tried to use lang and xml:lang tag but they are not efficient as I expected it because I have seen some website has a different language other than what specified in the tag,any help will be appreciated? (environment java eclipse)
Asked
Active
Viewed 816 times
1 Answers
1
This is a classical problem in nlp, and gives pretty good predictions. This post looks similar to this one: link and have there some good answers. I'm not familiar with the solutions mentioned there, but I did used the Apache Tika for another matter and it's a great open source. Hope that helps..
-
I am working on java and I already check the link but the library they referred (specifically language detector)has some issue of accuracy it gives different result for the same text and supports limited language – noble_man Apr 26 '16 at 06:18