3

I am trying to read a PDF file and its departments, but I can't find an algorithm or library to do it correctly.

I want to separate the parts of a file(Header,abstract,refrences) and return their contents.

Does a PDFBox reference exist to solve to this problem?

zx485
  • 28,498
  • 28
  • 50
  • 59
Fouad-abdi
  • 31
  • 4
  • If you mean extracting tables, no, PDFBox can't do this unless you know exactly where everything is. Maybe tabula can help you, this is on top of PDFBox. – Tilman Hausherr Jan 14 '17 at 20:45
  • For which kinds of pdfs do you want that? I ask because the task becomes more difficult the larger the set of pdfs you have to consider becomes. – mkl Jan 14 '17 at 20:47
  • @mkl i use PDFbox for read the peapers and mangement a bank of peapers – Fouad-abdi Jan 14 '17 at 21:00
  • Can you share a set of representative PDFs to analyze for patterns to recognize the sections you search? – mkl Jan 14 '17 at 21:02
  • @mkl forExample: – Fouad-abdi Jan 14 '17 at 21:04
  • @mkl http://merlot.usc.edu/csac-f06/papers/Wang05a.pdf – Fouad-abdi Jan 14 '17 at 21:04
  • I'll have a look. Another item, I just saw you seem to use the .Net version of pdfbox, at least you tagged your question [tag:c#], but I'm only acquainted with the java version of pdfbox. If I find something, would a java example be OK for you? And which pdfbox version do you use? A 2.0.x or a 1.8.x? – mkl Jan 15 '17 at 10:11
  • @mkl yeh i can convert java source to C# – Fouad-abdi Jan 15 '17 at 12:28

1 Answers1

4

The file provided by the OP as representative example unfortunately is not tagged. Thus, there are no direct information indicating whether a given piece of text belongs to the title, the abstract, the references, or which part ever. As a consequence, there are no sure ways to identify such parts but merely heuristics, aka educated guesswork, with a more or less large error rate.

In the case of the sample document provided by the OP, identification of the parts can actually be accomplished by simple inspection of the font of the first letter of each line.

The following classes constitute a simple framework for extraction of semantic text sections which are recognizable by their characteristics of each line alone, and a sample of its usage to recognize sections in the OP's sample file by inspecting only the font of first character of each line.

Simple text section extraction framework

As I've only worked with the Java version of PDFBox yet and the OP declared that a Java solution would also be ok, the framework is implemented in Java. It is based on the current development version 2.1.0-SNAPSHOT of PDFBox.

PDFTextSectionStripper

This class constitutes the hub of the framework. It is derived from the PDFBox PdfTextStripper and extends that class by recognition of text sections as configured by a list of TextSectionDefinition instances, see below. Once the PdfTextStripper method getText is called, the recognized sections are provided as a list of TextSection instances, see below.

public class PDFTextSectionStripper extends PDFTextStripper
{
    //
    // constructor
    //
    public PDFTextSectionStripper(List<TextSectionDefinition> sectionDefinitions) throws IOException
    {
        super();
        
        this.sectionDefinitions = sectionDefinitions;
    }

    //
    // Section retrieval
    //
    /**
     * @return an unmodifiable list of text sections recognized during {@link #getText(PDDocument)}.
     */
    public List<TextSection> getSections()
    {
        return Collections.unmodifiableList(sections);
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();

        if (!currentLine.isEmpty())
        {
            boolean matched = false;
            if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
            {
                TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
                switch (definition.multiLine)
                {
                case multiLine:
                    if (definition.matchPredicate.test(currentLine))
                    {
                        currentBody.add(new ArrayList<>(currentLine));
                        matched = true;
                    }
                    break;
                case multiLineHeader:
                case multiLineIntro:
                    boolean followUpMatch = false;
                    for (int i = definition.multiple ? currentSectionDefinition : currentSectionDefinition + 1;
                            i < sectionDefinitions.size(); i++)
                    {
                        TextSectionDefinition followUpDefinition = sectionDefinitions.get(i);
                        if (followUpDefinition.matchPredicate.test(currentLine))
                        {
                            followUpMatch = true;
                            break;
                        }
                    }
                    if (!followUpMatch)
                    {
                        currentBody.add(new ArrayList<>(currentLine));
                        matched = true;
                    }
                    break;
                case singleLine:
                    System.out.println("Internal error: There can be no current header or body as long as the current definition is single line only");
                }

                if (!matched)
                {
                    sections.add(new TextSection(definition, currentHeader, currentBody));
                    currentHeader.clear();
                    currentBody.clear();
                    if (!definition.multiple)
                        currentSectionDefinition++;
                }
            }

            if (!matched)
            {
                while (currentSectionDefinition < sectionDefinitions.size())
                {
                    TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
                    if (definition.matchPredicate.test(currentLine))
                    {
                        matched = true;
                        switch (definition.multiLine)
                        {
                        case singleLine:
                            sections.add(new TextSection(definition, currentLine, Collections.emptyList()));
                            if (!definition.multiple)
                                currentSectionDefinition++;
                            break;
                        case multiLineHeader:
                            currentHeader.addAll(new ArrayList<>(currentLine));
                            break;
                        case multiLine:
                        case multiLineIntro:
                            currentBody.add(new ArrayList<>(currentLine));
                            break;
                        }
                        break;
                    }

                    currentSectionDefinition++;
                }
            }

            if (!matched)
            {
                System.out.println("Could not match line.");
            }
        }
        currentLine.clear();
    }

    @Override
    protected void endDocument(PDDocument document) throws IOException
    {
        super.endDocument(document);

        if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
        {
            TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
            sections.add(new TextSection(definition, currentHeader, currentBody));
            currentHeader.clear();
            currentBody.clear();
        }
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        super.writeString(text, textPositions);

        currentLine.add(textPositions);
    }
    
    //
    // member variables
    //
    final List<TextSectionDefinition> sectionDefinitions;

    int currentSectionDefinition = 0;
    final List<TextSection> sections = new ArrayList<>();
    final List<List<TextPosition>> currentLine = new ArrayList<>();

    final List<List<TextPosition>> currentHeader = new ArrayList<>();
    final List<List<List<TextPosition>>> currentBody = new ArrayList<>();
}

(PDFTextSectionStripper.java)

TextSectionDefinition

This class specifies the properties of a text section type, a name, a matching predicate, a MultiLine property, and a multiple occurrences flag.

The name is purely descriptive.

The matching predicate is a function that is given detailed information on the characters on a text line and returns whether this line matches the text section type in question.

The MultiLine property can take one of four different values:

  • singleLine - for sections which consist of a single line only;
  • multiLine - for multiline sections in which each line must match the predicate;
  • multiLineHeader - for multiline sections in which the first line only needs to match the predicate and this first line is a header line;
  • multiLineIntro - for multiline sections in which the first line only needs to match the predicate and this first line is a regular part of the section, probably merely introduced by a special marker word.

The multiple occurrences flag indicates whether there can be multiple instances of this type of text section.

public class TextSectionDefinition
{
    public enum MultiLine
    {
        singleLine,         // A single line without text body, e.g. title
        multiLine,          // Multiple lines, all match predicate, e.g. emails  
        multiLineHeader,    // Multiple lines, first line matches as header, e.g. h1
        multiLineIntro      // Multiple lines, first line matches inline, e.g. abstract
    }

    public TextSectionDefinition(String name, Predicate<List<List<TextPosition>>> matchPredicate, MultiLine multiLine, boolean multiple)
    {
        this.name = name;
        this.matchPredicate = matchPredicate;
        this.multiLine = multiLine;
        this.multiple = multiple;
    }

    final String name;
    final Predicate<List<List<TextPosition>>> matchPredicate;
    final MultiLine multiLine;
    final boolean multiple;
}

(TextSectionDefinition.java)

TextSection

This class represents a text section recognized by this framework.

public class TextSection
{
    public TextSection(TextSectionDefinition definition, List<List<TextPosition>> header, List<List<List<TextPosition>>> body)
    {
        this.definition = definition;
        this.header = new ArrayList<>(header);
        this.body = new ArrayList<>(body);
    }

    @Override
    public String toString()
    {
        StringBuilder stringBuilder = new StringBuilder();
        stringBuilder.append(definition.name).append(": ");
        if (!header.isEmpty())
            stringBuilder.append(toString(header));
        stringBuilder.append('\n');
        for (List<List<TextPosition>> bodyLine : body)
        {
            stringBuilder.append("    ").append(toString(bodyLine)).append('\n');
        }
        return stringBuilder.toString();
    }

    String toString(List<List<TextPosition>> words)
    {
        StringBuilder stringBuilder = new StringBuilder();
        boolean first = true;
        for (List<TextPosition> word : words)
        {
            if (first)
                first = false;
            else
                stringBuilder.append(' ');
            for (TextPosition textPosition : word)
            {
                stringBuilder.append(textPosition.getUnicode());
            }
        }
        // cf. https://stackoverflow.com/a/7171932/1729265
        return Normalizer.normalize(stringBuilder, Form.NFKC);
    }

    final TextSectionDefinition definition;
    final List<List<TextPosition>> header;
    final List<List<List<TextPosition>>> body;
}

(TextSection.java)

Concerning the Normalizer.normalize(stringBuilder, Form.NFKC) call cf. this answer to the stack overflow question "Separating Unicode ligature characters".

Example use

On can use this framework with very simple matching predicates to recognize the sections in the representative sample provided by the OP:

List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
        new TextSectionDefinition("Titel", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.singleLine, false),
        new TextSectionDefinition("Authors", x->x.get(0).get(0).getFont().getName().contains("CMR10"), MultiLine.multiLine, false),
        new TextSectionDefinition("Institutions", x->x.get(0).get(0).getFont().getName().contains("CMR9"), MultiLine.multiLine, false),
        new TextSectionDefinition("Addresses", x->x.get(0).get(0).getFont().getName().contains("CMTT9"), MultiLine.multiLine, false),
        new TextSectionDefinition("Abstract", x->x.get(0).get(0).getFont().getName().contains("CMBX9"), MultiLine.multiLineIntro, false),
        new TextSectionDefinition("Section", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.multiLineHeader, true)
        );

PDDocument document = PDDocument.load(resource);
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);

System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections())
{
    String text = textSection.toString();
    System.out.println(text);
    texts.add(text);
}
Files.write(new File(RESULT_FOLDER, "Wang05a.txt").toPath(), texts);

(ExtractTextSections.java test method testWang05a)

The shortened result:

Titel: How to Break MD5 and Other Hash Functions

Authors: 
    Xiaoyun Wang and Hongbo Yu

Institutions: 
    Shandong University, Jinan 250100, China,

Addresses: 
    xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn

Abstract: 
    Abstract. MD5 is one of the most widely used cryptographic hash func-
    tions nowadays. It was designed in 1992 as an improvement of MD4, and
    ...

Section: 1 Introduction
    People know that digital signatures are very important in information security.
    The security of digital signatures depends on the cryptographic strength of the
    ...

Section: 2 Description of MD5
    In order to conveniently describe the general structure of MD5, we first recall
    the iteration process for hash functions.
    ...

Section: 3 Differential Attack for Hash Functions
    3.1 The Modular Differential and the XOR Differential
    The most important analysis method for hash functions is differential attack
    ...

Section: 4 Differential Attack on MD5
    4.1 Notation
    Before presenting our attack, we first introduce some notation to simplify the
    ...

Section: 5 Summary
    In this paper we described a powerful attack against hash functions, and in
    particular showed that finding a collision of MD5 is easily feasible.
    ...

Section: Acknowledgements
    It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
    this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and
    ...

Section: References
    1. E. Biham, A. Shamir. Differential Cryptanalysis of the Data Encryption Standard,
    Springer-Verlag, 1993.
    ...

For more generic text section recognition, one obviously cannot count on these specific TeX fonts to be used to signal a specific text section. Instead on might have to look at font sizes (remember not to take the simple font size attribute but scale it according to the transformation and text matrix!), alignment, etc. Probably one needs to first scan the document to determine common text sizes etc.

In case of multiple documents published in the same magazine, though, recognition predicates might actually be as simple as in the example above because in such situations authors often have to stick to very specific layout and format rules.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265