0

I am trying to read PDF documents and I need them to be separated by sections using header font size or font and font size I currently have it implemented based on the answer of this post. But due to my PDF having the same font for header and the sub-header I need to modify the code so it would search based on font size or both.

    List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
            new TextSectionDefinition("Section", x -> x.get(0).get(0).getFont().getName().contains("Calibri,Bold"), TextSectionDefinition.MultiLine.multiLineHeader, true)
    );

    document.getClass();
    PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
    stripper.getText(document);

    System.out.println("Sections:");
    List<String> texts = new ArrayList<>();
    for (TextSection textSection : stripper.getSections()) {
        String text = textSection.toString();
        System.out.println(text);
        texts.add(text);
    }

    return ResponseEntity.ok(texts);

My problem stems if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16 (font size).

halfer
  • 19,824
  • 17
  • 99
  • 186

1 Answers1

0

In the answer you refer to there are text section definitions like this:

new TextSectionDefinition("Titel",
    x->x.get(0).get(0).getFont().getName().contains("CMBX12"),
    MultiLine.singleLine,
    false)

I assume your remark

if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16

indicates that you want to exchange the lambda expression in the second parameter

x->x.get(0).get(0).getFont().getName().contains("CMBX12")

by something that tests the font size. Thus, have you tried replacing it by

x->x.get(0).get(0).getFontSize() == 16

or

x->x.get(0).get(0).getFontSizeInPt() == 16

or

x-> {
    float size = x.get(0).get(0).getFontSizeInPt();
    return size > 15 && size < 17;
}

yet?

mkl
  • 90,588
  • 15
  • 125
  • 265
  • I have tested with your provided solutions, but I get a full console with "Could not match line.', even when i tried using the size intervals from 1 to 80, to check if it would find any matches, it didn't. Tried using the font and the font size in the same section definition, but the output was still the same. Could it be something to do with the PDF file itself, hopefully something i could modify to make it work as needed? (FYI PDF file is not Secured) – Vytautas Jun 16 '20 at 10:27
  • @Vytautas Please share the PDF in question for analysis. – mkl Jun 16 '20 at 15:46
  • unfortunately the PDF is confidential and i cannot share it, could you please provide some pointers if possible what should i check in the PDF? – Vytautas Jun 17 '20 at 06:40