2

I am working on web page language detection and I managed to retrieve the content of a page through other middleware I developed. since there is no standardization the content location. However, I don't know how to detect the language I tried to use lang and xml:lang tag but they are not efficient as I expected it because I have seen some website has a different language other than what specified in the tag,any help will be appreciated? (environment java eclipse)

noble_man
  • 352
  • 3
  • 20

1 Answers1

1

This is a classical problem in nlp, and gives pretty good predictions. This post looks similar to this one: link and have there some good answers. I'm not familiar with the solutions mentioned there, but I did used the Apache Tika for another matter and it's a great open source. Hope that helps..

Community
  • 1
  • 1
lazary
  • 449
  • 1
  • 8
  • 17
  • I am working on java and I already check the link but the library they referred (specifically language detector)has some issue of accuracy it gives different result for the same text and supports limited language – noble_man Apr 26 '16 at 06:18