0

I want to detect company names in PDFs, I have all the PDFs in XML format too.

Detecting them manually is relatively easy since these names almost always appear next to their address, phone/fax, address etc. They might be in different positions in the PDF, they might include slightly different information but there's definitely a pattern.

I'm having trouble converting this logic into an algorithm. This is an example of what the XML looks like:

<?xml version="1.0" encoding="UTF-8"?>
<Document file="filename.pdf">
  <Page number="1" box="0, 0, 1653, 2338" rotated="false" resolution="200">
    <Block page="1" block_id="B_1">
    <!-- random text on the top part of the pdf -->
    <Environment>
        <Bottom>B_2</Bottom>
    </Environment>
    </Block>
    <Block page="1" block_id="B_2">
    <Properties  num_lines="1" box="268, 303, 529, 323">
    </Properties>
    <Line nr="0">
        <Word box="268, 303, 366, 323">company</Word>
        <Word box="373, 303, 451, 323">name</Word>
        <Word box="459, 303, 529, 323">SA</Word>
    </Line>
    <Plain_Text>Company Name SA
    </Plain_Text>
    <Environment>
    <Right>B_3</Right>
    <Bottom>B_5</Bottom>
    </Environment>
    </Block>
    <Block page="1" block_id="B_3">
    <!-- same structure as before but with the address -->
    </Block>
    <Block page="1" block_id="B_5">
    <!-- same structure as before but with the phone number -->
    </Block>
  </Page>
</Document>

My current idea is to separate the algorithm in 2:

  1. Find the pattern, the Blocks containing the address, phone number, fax, or any combination of them. Detecting the phone/fax is easy, how to detect the address is a whole another problem.
  2. Find the text on top of these blocks.

Another complication is that these patterns vary considerably, sometimes there's no phone, or the address is in two lines (and two Blocks), sometimes there's more text that's not the company name.

This is why I am drawn to using an LSTM architecture (because the sequentiallity of the XML) to detect the Block, but I'm at a loss with the implementation.

I don't know how to feed the data, do I divide each block in a new line? It's not a classification problem since each XML has different #blocks, I don't think it's regression either, I can't use embeddings because I lose info on the block_id. I also have no clue how to give the architecture the information/heuristics about the address, phone pattern.

I have experience with programming and NLP but not much with XML and XML parsing. Both my ideas have problems I don't know how to solve, I don't know which(or if any) idea I should pursue, or if they are well founded at all. Any help is appreciated, thank you!

Ane
  • 43
  • 1
  • 7

1 Answers1

1

First, to simplify your problem, you could make some transformations turning your xml to some sort of data structure native to Programming Languages.

I'll give one example based in Python, which is probably the best programming language out there for NLP.

Start by transforming your XML to a Dict using xmltoDict (from this post)

pip install xmltodict

import xmltodict

xmlDocAsDict = xmltodict.parse(your_xml_file_as_string)

With this data structure, you'll be able to traverse through the blocks, as such:

for block in xmlDocAsDict['Document']['Page']['Block']:
    #do Something!

Now, you could apply a NER in the text inside the block to attempt to find ORGs. I've tested a couple NERs, but the only one that seemed to work well was AllenNLP (see an example here) - However, AllenNLP is not too straightforward.

So:

pip install allennlp

Then:

from allennlp import pretrained
ner = pretrained.named_entity_recognition_with_elmo_peters_2018()

(The first run may take some time due to model download)

Back to the blocks:

for block in xmlDocAsDict['Document']['Page']['Block']:
    doc = ner.predict(block)
    for tag in doc['tags']:
         if tag in ('B-ORG', 'L-ORG'):
            print(doc['words'][doc['tags'].index(tag)])

I've tested with Spacy with no good results (not even using large corpus). But you can train your own NER from your data, making it even more precise. Read more in: https://spacy.io/usage/training/#ner

Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • Thank you! Transversing the xml isn't a problem, I know ET (xml.etree.ElementTree), but thank you for suggesting xmltodict. Yeah I tried with spacy too and it didn't work very well, thank you for the AllenNLP link! – Ane Sep 13 '19 at 14:21