2

This question is already asked but the query i have is not answered. i have a pdf with table in which some columns are not having any values. I need to read those blank spaces. I have used Itext pdf for extracting data from pdf but while reading the data from table it is read col by col and the column having no value is not read with white spaces but the next column is read. I have customized LocationTextExtractionStrategy and have overridden getResultantText() In below image if there is no value for MD and TD col 1,2,3 then while reading the PDF after 1 it is not giving me spaces but giving the next value that is 2. Is there any solution for this to read the blank spaces

PdfReader reader = new PdfReader(filename);

FontRenderFilter fontFilter = new FontRenderFilter();

TextExtractionStrategy strategy = new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(),fontFilter);
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        String finalText = PdfTextExtractor.getTextFromPage(reader, i, strategy);

        System.out.println("finalText.." + finalText);
    }

pdf table to be exttracted

Amedee Van Gasse
  • 7,280
  • 5
  • 55
  • 101
  • Have you already tried the strategy proposed in [this answer](https://stackoverflow.com/a/24911617/1729265)? – mkl Mar 28 '18 at 06:02
  • Yes I have tried that example, here its creating spaces.I tried to override getResultantText() but while reading teh chunks after reading the value 1 from pdf table its reading next value as 2 and not as white space – romana parween Mar 28 '18 at 06:30
  • how can I read the white space I have for after 1 in MD and TD column. please help – romana parween Mar 28 '18 at 06:31
  • When you talk about whitespace, what do you mean? Are you referring to characters such as ` `, `\n`, `\t`? In that case your question is wrong. For instance: the `\t` character isn't a tab in PDF syntax. In a PDF, text is added at absolute positions, using coordinates on a page. When I look at your screen shot, I don't see whitespace characters (and if you're completely honest: neither do you). What we see, is text added at different x-offsets. Because of the different x-offsets, there is space between the words. – Bruno Lowagie Mar 28 '18 at 07:18
  • *"how can I read the white space I have for after 1 in MD and TD column."* ... *"I need this value"* - There are no white spaces. So you cannot read them. There is no value. So you can not need it. There only are some border lines, that's it. – mkl Mar 28 '18 at 08:35
  • table is MD TD 1 2 Ineed the value coming after 1 in MD and TD col. Here it is "" means no value blank which pdf extractor is not able to read. Atleast can I check if it the row with value "1" is having any value or not. – romana parween Mar 28 '18 at 08:45
  • can I get it using x-offset value? I need to check if that row is having any value or not for both MD and TD columns and then I need to store as null if there is no value for each row – romana parween Mar 28 '18 at 08:48
  • At least let me know if it is possible to read the empty value "" using iText or not. – romana parween Mar 28 '18 at 09:43
  • *"let me know if it is possible to read the empty value "" using iText or not."* - The text extraction strategy I linked to above just like the standard iText extraction strategies will return a '\n' each after the `1`, the `2`, and the `3`. You expect MD and TD values after each number on the same line. You can, therefore, interpret the fact that there is nothing between the number and the '\n' to mean that the MD and TD values are empty. If your *customized `LocationTextExtractionStrategy`* does not return those '\n' characters, those customizations probably aren't good for your use case. – mkl Mar 28 '18 at 10:39
  • I am getting '\n' if there is no value between 1 and 2. but if the case is MD is blank and TD is having value.Then while reading pdf how to know value belongs to which column – romana parween Mar 28 '18 at 11:15

0 Answers0