I am using tika-python to extract text from pdf. But when there are multiple table in a pdf page, the order of the text is not preserved. In my case the table at the top of the page comes at the end when extracted through tika.
I tried using following custom config file. But it is not working. I have tried keeping the statement <property name="sortByPosition" value="True"/>
at various positions. But nothing has worked. I referred this for the config.xml.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Default Parser for most things, except for 2 mime types, and never
use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
<!-- property name="sortByPosition" value="True" -->
</parser>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
<!-- here? -->
<property name="sortByPosition" value="True"/> # this statement is for preserving the order
</parser>
</parsers>
</properties>
and the following command to read the text:
from tika import parser
data = parser.from_file(file_path, xmlContent=True,
config_path=/path/to/'tika_config.xml')
What I am doing wrong or what is the way to change the config or preserving order is not possible?