0

I am working on a project, where there are pdfs with content is English and Spanish language,I am interested only in English part of it and save it to Database.I am using Apache PDF box for extracting the text out of it.How can I avoid Spanish content and get text having only English part of it.I tried out some library like Apache Tika and https://code.google.com/p/language-detection/ but they are not giving correct result in some cases.Can anyone please provide some reliable solution or any other way to achieve the requirement. Thanks in Advance.

Sunny Gupta
  • 191
  • 1
  • 4
  • 14
  • Welcome to SO. Recommendations of tools, software or tools is off topic for this site. – Fildor Aug 11 '15 at 07:07
  • assuming the characters are unicode encoded, i would extract all the words with letters that fall in the range of `0x0061 - 0x007a` and `0x0041 - 0x006a`. http://unicode.org/charts/PDF/U0000.pdf – chathux Aug 11 '15 at 07:12
  • You can split text to paragraphs and, if paragraph contains any spanish-letter, or assent `á, é, í, ó, ú, ü, ñ, ¿, ¡` count it as spanish. – user1516873 Aug 11 '15 at 07:13
  • @user1516873 That would kick out any paragraph in English containing a spanish name (that contains a Spanish-letter) ... – Fildor Aug 11 '15 at 07:14
  • Possible duplicate: http://stackoverflow.com/questions/3227524/how-to-detect-language-of-user-entered-text?rq=1 – Fildor Aug 11 '15 at 07:17
  • In that case you need maintain dictionary of specific spanish words-markers, like `y, el, la`. But sure you can use specific library too. – user1516873 Aug 11 '15 at 07:25
  • @SunnyGupta You say, you already have a solution but it has some cases, where it is not correct. Is it that it detect Spanish for English or English for Spanish or both? In the first case, I'd just save them all and provide a button "This is not English" or some other means to manually "clean" the database. Not perfect, but I doubt you will get 100%. – Fildor Aug 11 '15 at 07:31
  • @Fildor I tried using Apache tika and search just "Please god same me" it is giving es spanish as result – Sunny Gupta Aug 12 '15 at 06:18
  • Maybe you should try other engines as well. Personally, I would run at least two and compare results. Does the recognition have some "confidence" level? – Fildor Aug 12 '15 at 11:28
  • At least I see it has `isReasonablyCertain() ` ... Do you use it? – Fildor Aug 12 '15 at 11:31
  • yeah... I think... i can use it... – Sunny Gupta Aug 13 '15 at 11:08
  • https://code.google.com/p/language-detection/ this seems to be a good choice.It is workind fine in my case. what do you think? – Sunny Gupta Aug 13 '15 at 11:10

0 Answers0