I want to detect company names in PDFs, I have all the PDFs in XML format too.
Detecting them manually is relatively easy since these names almost always appear next to their address, phone/fax, address etc. They might be in different positions in the PDF, they might include slightly different information but there's definitely a pattern.
I'm having trouble converting this logic into an algorithm. This is an example of what the XML looks like:
<?xml version="1.0" encoding="UTF-8"?>
<Document file="filename.pdf">
<Page number="1" box="0, 0, 1653, 2338" rotated="false" resolution="200">
<Block page="1" block_id="B_1">
<!-- random text on the top part of the pdf -->
<Environment>
<Bottom>B_2</Bottom>
</Environment>
</Block>
<Block page="1" block_id="B_2">
<Properties num_lines="1" box="268, 303, 529, 323">
</Properties>
<Line nr="0">
<Word box="268, 303, 366, 323">company</Word>
<Word box="373, 303, 451, 323">name</Word>
<Word box="459, 303, 529, 323">SA</Word>
</Line>
<Plain_Text>Company Name SA
</Plain_Text>
<Environment>
<Right>B_3</Right>
<Bottom>B_5</Bottom>
</Environment>
</Block>
<Block page="1" block_id="B_3">
<!-- same structure as before but with the address -->
</Block>
<Block page="1" block_id="B_5">
<!-- same structure as before but with the phone number -->
</Block>
</Page>
</Document>
My current idea is to separate the algorithm in 2:
- Find the pattern, the
Block
s containing the address, phone number, fax, or any combination of them. Detecting the phone/fax is easy, how to detect the address is a whole another problem. - Find the text on top of these blocks.
Another complication is that these patterns vary considerably, sometimes there's no phone, or the address is in two lines (and two Block
s), sometimes there's more text that's not the company name.
This is why I am drawn to using an LSTM architecture (because the sequentiallity of the XML) to detect the Block
, but I'm at a loss with the implementation.
I don't know how to feed the data, do I divide each block in a new line? It's not a classification problem since each XML has different #blocks, I don't think it's regression either, I can't use embeddings because I lose info on the block_id
. I also have no clue how to give the architecture the information/heuristics about the address, phone pattern.
I have experience with programming and NLP but not much with XML and XML parsing. Both my ideas have problems I don't know how to solve, I don't know which(or if any) idea I should pursue, or if they are well founded at all. Any help is appreciated, thank you!