4

I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika.

I tried using following custom config file. But it is not working. I have tried keeping the statement <property name="sortByPosition" value="True"/> at various positions. But nothing has worked. I referred this for the config.xml.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Default Parser for most things, except for 2 mime types, and never
         use the Executable Parser -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
      <!-- property name="sortByPosition" value="True" -->
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
      <!-- here? -->
      <property name="sortByPosition" value="True"/>  # this statement is for preserving the order
    </parser>
  </parsers>
</properties>

and the following command to read the text:

from tika import parser
data = parser.from_file(file_path, xmlContent=True,
                        config_path=/path/to/'tika_config.xml')

What I am doing wrong or what is the way to change the config or preserving order is not possible?

ggaurav
  • 1,764
  • 1
  • 10
  • 10
  • Interested in this as well. Did you ever get a response? – hanreli May 26 '21 at 19:16
  • No, I did not. What I concluded was - it is more to do with how the pdf was constructed originally and not with how it looks to us. But once we have the pdf, I think we can never know if this order issue can occur or not. Some kind of ocr/computer vision has to be used for pdfs is what my finding was. I will update here if I get any tika or non computer vision solution. – ggaurav Jul 28 '21 at 15:01

0 Answers0