0

I am trying to parse a PDF file into an Excel file

PDDocument document = Loader.loadPDF(new File("Example.pdf"));

// Extract the text from the PDF
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

// Create a new Excel workbook
XSSFWorkbook workbook = new XSSFWorkbook();

// Create a new sheet in the workbook
XSSFSheet sheet = workbook.createSheet("PDF Data");

// Split the text into rows and columns
String[] rows = text.split("\n");
for (int i = 0; i < rows.length; i++) {
    String[] cols = rows[i].split("\t");
    Row row = sheet.createRow(i);
    for (int j = 0; j < cols.length; j++) {
        Cell cell = row.createCell(j);
        cell.setCellValue(cols[j]);
    }
}
// Save the Excel file
FileOutputStream fileOut = new FileOutputStream("outputFile.xlsx");
workbook.write(fileOut);
workbook.close();

// Close the PDF document
document.close();

My PDF file looks like a table.

But this code does not insert data into the correct columns, that is, instead of

|   name   |    some data   | some data 2| some data 3 |    some data 4    |  some data 5 |
+----------+----------------+------------+-------------+-------------------+--------------+

Code inserts data into one column

name
some data
some data 2
some data 3
some data 4
some data 5

Please help me achieve the result so that the data is distributed over several columns and not in one.

The whole problem I have is that I need to write the data from the EXCEL file to the database

Alexander
  • 167
  • 1
  • 10
  • 1
    `PDFTextStripper` will ignore all formatting, as stated in its JavaDoc. That includes things like "place that string right of that other string". In other words: if you care about how text is arranged and it's not just straight-up prose, then this is probably the wrong tool for the job. – Joachim Sauer Jan 24 '23 at 11:33
  • @Joachim Sauer For me it is very important that the data is displayed in all columns. Please tell me the correct solution – Alexander Jan 24 '23 at 11:36
  • Check the duplicate linked at the top of your question, it contains multiple viable approaches. – Joachim Sauer Jan 24 '23 at 11:36
  • @Joachim Sauer I already tried to do so. This doesn't work for me as empty cells are removed by this code. And the structure of the document is not correct. Moreover, I subsequently write the data from the already generated excel file to the database – Alexander Jan 24 '23 at 11:40
  • How can I write data to the database through the code that you recommended to me? – Alexander Jan 24 '23 at 11:43
  • That question has at least a dozen answers, most of which are upvoted. I don't quite believe that you've tried *all of them* (or even more than one). Also your last comment is a total non-sequitur and doesn't seem related to this question at all. – Joachim Sauer Jan 24 '23 at 11:51

0 Answers0