Web page Language detection based on content

Question

I am working on web page language detection and I managed to retrieve the content of a page through other middleware I developed. since there is no standardization the content location. However, I don't know how to detect the language I tried to use lang and xml:lang tag but they are not efficient as I expected it because I have seen some website has a different language other than what specified in the tag,any help will be appreciated? (environment java eclipse)

score 1 · Accepted Answer · edited May 23 '17 at 12:23

1

This is a classical problem in nlp, and gives pretty good predictions. This post looks similar to this one: link and have there some good answers. I'm not familiar with the solutions mentioned there, but I did used the Apache Tika for another matter and it's a great open source. Hope that helps..

edited May 23 '17 at 12:23

Community

1
1

answered Apr 25 '16 at 15:45

lazary

449
1
8
17

I am working on java and I already check the link but the library they referred (specifically language detector)has some issue of accuracy it gives different result for the same text and supports limited language – noble_man Apr 26 '16 at 06:18

Web page Language detection based on content

1 Answers1