1

I want to get the all lines from a pdf book which have italicized words in it.

example (one line on pdf is) :

this is dog.

I want the output as all lines which contains italic texts. In this case, whole line will be the output.

can i get that from pdf using any parsing in java or python? or somewhere I will get such list of lines.

sam
  • 18,509
  • 24
  • 83
  • 116
  • 1
    This is a very broad question. You should break it down into questions about how to fetch, how to parse the result, and how to define 'italic lines' and how to filter them. And pick a language. – Joe Sep 07 '13 at 17:14
  • Do you actually see the asterisks? If yes: just go trough the document and take everything between asterisks. – Jeroen Vannevel Sep 07 '13 at 17:14
  • no i dt see asterriks. it was by mistake here. – sam Sep 07 '13 at 17:24
  • @Joe : ya joe i know dat but i thought may be someone hav such lines of sentences already or else i have to parse pdf. but i dont know next operation. – sam Sep 07 '13 at 17:25
  • This may give you a good start: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf – s.bandara Sep 07 '13 at 17:55
  • Better you should tell whoever not to post copyrighted e books on the public Internet. – Ernest Friedman-Hill Sep 07 '13 at 21:26
  • @ErnestFriedman-Hill : yes done. sorry that was by mistake. – sam Sep 08 '13 at 02:15

1 Answers1

3

After a bit of playing around with iText® here is what I came up with:

Most italic fonts are identified by a negative italic angle as stated here

Based on this example for SimpleTextExtractionStrategy I made a custom RenderListener by extending SimpleTextExtractionStrategy which I named ItalicTextExtractionStrategy and I overrode the method renderText accordingly:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.DocumentFont;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;

public class ExtractItalicText {

    final static class ItalicTextExtractionStrategy extends SimpleTextExtractionStrategy {
        @Override
        public void renderText(TextRenderInfo arg0) {
            DocumentFont font = arg0.getFont();
            String[][] familyFontNamesArray = font.getFamilyFontName();
            for(String[] familyFontNames : familyFontNamesArray) {
                for(String familyFontName : familyFontNames) {
                    if(familyFontName.toLowerCase().contains("italic")) {
                        float italicAngle = font.getFontDescriptor(BaseFont.ITALICANGLE,
                                0 /* not relevant for ItalicAngle otherwise 1000 is a good value 
                                     source: http://grepcode.com/file/repo1.maven.org/maven2/com.itextpdf/itextpdf/5.4.2/com/itextpdf/text/pdf/DocumentFont.java#DocumentFont */);
                        if(italicAngle < 0) {
                            super.renderText(arg0);
                        }
                        break;
                    }
                }   
            }
        }
    }

    public static void extractItalicText(String pdf) throws IOException {
        PdfReader reader = null;
        PrintWriter out = null;
        PrintWriter outItalic = null;
        long s = System.currentTimeMillis();
        try {
            System.out.println("Processing: " + pdf + " ...");
            // output for original text including italic styled
            out = new PrintWriter(new FileOutputStream("src/main/resources/" + new File(pdf).getName() + ".txt"));
            // output for italic styled text only
            outItalic = new PrintWriter(new FileOutputStream("src/main/resources/" + new File(pdf).getName() + "_italic.txt"));
            reader = new PdfReader(pdf);
            int numberOfPages = reader.getNumberOfPages();
            for(int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
                // extract italic text
                String pageItalicText = PdfTextExtractor.getTextFromPage(reader, pageNumber, new ItalicTextExtractionStrategy());
                if(pageItalicText.trim().length() > 0) {
                    // we have some italic text in the current page, so we get the hole text of the page
                    // to search for the lines where the italic text is located
                    String textFromPage = PdfTextExtractor.getTextFromPage(reader, pageNumber);
                    String[] textLinesFromPage = textFromPage.split("[\r\n]");

                    // punctuation marks etc. are sometime not part of the italic text, so we need to clean the line
                    // map a cleaned line to a raw line
                    Map<String, String> cleanedtextLines = new LinkedHashMap<String, String>(textLinesFromPage.length * 4 / 3 + 1);
                    for(String line : textLinesFromPage) {
                        out.println(line);
                        // clean line from all non-word characters
                        cleanedtextLines.put(line.replaceAll("\\W", ""), line);
                    }
                    // split the italic text into lines
                    String[] italicTextLines = pageItalicText.split("[\r\n]");
                    Set<String> linesContainingItalicText = new HashSet<String>(italicTextLines.length * 4 / 3 + 1);
                    for(String italicText : italicTextLines) {
                        // clean the italic text from non-word characters
                        String cleanedItalicText = italicText.replaceAll("\\W", "");
                        // search for the corresponding line
                        for(Entry<String, String> lineEntry : cleanedtextLines.entrySet()) {                            
                            if((! linesContainingItalicText.contains(lineEntry.getKey())) 
                                    && lineEntry.getKey().contains(cleanedItalicText)) {
                                linesContainingItalicText.add(lineEntry.getKey());
                                // output the raw line
                                outItalic.println(lineEntry.getValue());
                            }
                        }
                    }
                }
                out.println("==== Page " + pageNumber + " =========================================================\n");
                outItalic.println("==== Page " + pageNumber + " =========================================================\n");
            }

        } finally {
            if(out != null) {
                out.close();
            }
            if(outItalic != null) {
                outItalic.close();
            }
            if(reader != null) {
                reader.close(); 
            }
            long e = System.currentTimeMillis();
            System.out.println("done (" + (e-s) + " ms)");
        }
    }

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        for(String arg: args) {
            extractItalicText(arg);
        }
    }
}

The resulting output file contains texts which are italic in the original PDF document.

I hope this will help you solve your problem!

A4L
  • 17,353
  • 6
  • 49
  • 70
  • there are errors while running the file. In output, I do not want only italic text but I want whole statements which contains italic strings. – sam Sep 08 '13 at 02:11
  • @sam which errors? it compiles and runs fine over here with iText version `5.1.3` . As for extracting the hole line see the edited method `extractItalicText`, explanations are in the comments. I hope this will help you. If you have other requirements then you'll have to deeply look at the `iText` docs and APIs – A4L Sep 08 '13 at 11:13
  • :there is "reached end of file while parsing error " and when I add } at the end it gave me 27 errors. can you please check with ur code if it is half written here? – sam Sep 10 '13 at 13:40
  • 2
    @sam fixed, there were only a missing curly bracket at the end and a typo in `extractItalicText` (s instead of x in Text). If you still have errors then you might not have the `iText` library in your classpath while compiling. If not then you have to download it (see the link i pasted at the top of the post) and put it on your classpath while compiling and while running the program. iText is not standard java jdk! – A4L Sep 10 '13 at 18:01
  • 1
    @sam Besides that, do you know java? Actually you should have fixed that by yourself, if not I would suggest you look for a solution with python. The code I provided is just to give you some hints about how to go with that and is not supposed to be error free or to deliver the expected results for every pdf file. The pdf format is pretty tricky. Good luck! – A4L Sep 10 '13 at 18:01