I have tried most of the things on stack overflow and outside
Problem : I have a pdf with contents and tables . I need to parse tables and content as well.
Apis :
https://github.com/tabulapdf/tabula-java
I am using tabula-java
which ignores some contents and contents inside table cells are not seporated proper way.
MY PDF IS having content like this
DATE :1/1/2018 ABCD SCODE:FFFT
--ACCEPTED--
USER:ADMIN BATCH:RR EEE
CON BATCH
=======================================================================
MAIN SNO SUB VALUE DIS %
R 12 rr1 0125 24.5
SLNO DESC QTY TOTAL CODE FREE
1 ABD 12 90 BBNEW -NILL-
2 XDF 45 55 GHT55 MRP
3 QWE 08 77 CAT -NILL-
=======================================================================
MAIN SNO SUB VALUE DIS %
QW 14 rr2 0122 24.5
SLNO DESC QTY TOTAL CODE FREE
1 ABD 12 90 BBNEW -NILL-
2 XDF 45 55 GHT55 MRP
3 QWE 08 77 CAT -NILL-
Tabula code to convert :
public static void toCsv() throws ParseException {
String commandLineOptions[] = { "-p", "1", "-o", "$csv", };
CommandLineParser parser = new DefaultParser();
try {
CommandLine line = parser.parse(TabulaUtil.buildOptions(), commandLineOptions);
new TabulaUtil(System.out, line).extractFileInto(
new File("/home/sample/firstPage.pdf"),
new File("/home/sample/onePage.csv"));
} catch (Exception e) {
e.printStackTrace();
}
}
tabula even supports command line interface
java -jar TabulaJar/tabula-1.0.2-jar-with-dependencies.jar -p all -o $csv -b Pdfs
I have tried using -c,--columns <COLUMNS>
of tabula
which is takes cells by X coordinates of column boundaries
But the problem is my pdfs content is dynamic. i.e table sizes are changed.
These links in stack overflow and many more dint worked for me.
How to convert PDF to CSV with tabula-py?
How to extract table data from PDF as CSV from the command line?
How to convert a pdf file into CSV file?
Parse PDF table and display it as CSV(Java)
I have used pdf box which gives text which is unformatted where i cant read the table content properly.
Is posible to convert pdf with tables to csv/excel using java without loosing content and formatting.
I dont want to use paid libraries .