1

I have this code that extracts text from Scientific Paper (pdf) using PDFBox 2.0 .

public class Sectioning {

    private static ArrayList<String> titles = new ArrayList<String>();
    private static ArrayList<Integer> sectionsIndex = new ArrayList<Integer>();
    private static HashMap<String, String> Sections = new HashMap<String,String>();
    private static PDFManager pdfManager = new PDFManager();

    public Sectioning() {
    }

    //This method takes the PDF file and send it to (extractText) in PDFSectionsTitle class to get the titles in the PDF file
    public  ArrayList<String> GetTitles(File file) throws FileNotFoundException
    {
        FileInputStream fis = new FileInputStream(file);
        titles = extractText(fis);
         
        return titles;        
    }
    
    /*This method takes the PDF file and get its text then get the indexes of the titles in the text 
    then send the text to TextSections to get the titles and their sections and store them in a hashmap*/

     public HashMap<String, String> Section(File file) throws IOException
    {
        
        pdfManager.setFilePath(file.getPath()); 

        String text = pdfManager.toText();
        int prevstop = 0;
        
        for (int j = 0 ; j<=titles.size()-1 ; j++)
        {
        prevstop =  text.indexOf(titles.get(j),prevstop);
        sectionsIndex.add(prevstop);
        }
        
        TextSections(text);
       
        return Sections;
    }
    
    //Store in a hashmap the titles with their paragraphs
    public void TextSections(String text) 
    {
    for(int i = 0 ; i <= sectionsIndex.size()-1;i++)
        {
            if(i == sectionsIndex.size()-1) 
            {
               Sections.put(titles.get(i), text.substring(sectionsIndex.get(i)).replaceFirst(titles.get(i), "")); //for last title the paragraph is to the end of the file   
            }
            else
            {
                Sections.put(titles.get(i), text.substring(sectionsIndex.get(i), sectionsIndex.get(i+1)).replaceFirst(titles.get(i), "")); //The paragraphs of the current title ends where the next title exists
            }
        }
    
    }
    
    public void clear() throws IOException{
    titles.clear();
    sectionsIndex.clear();
    Sections.clear();
    pdfManager.closeDoc();
    }

When i execute the code it gives an excellent results, but on with SOME papers it gives me the exception below:

 Jul 22, 2020 12:03:01 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Jul 22, 2020 12:03:03 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -6237
    at java.lang.String.substring(String.java:1911)
    at pdfpapersections.Sectioning.TextSections(Sectioning.java:61)
    at pdfpapersections.Sectioning.Section(Sectioning.java:45)
    at pdfpapersections.PDFPaperSections.main(PDFPaperSections.java:46)
Java Result: 1

Anybody has any idea why its giving me this error? even though the files are NOT corrupted and i have extracted their text using other methods!

Emma
  • 9
  • 2
  • 2
    The exception is in your code, not PDFBox. Take a close look at `Sectioning.java:61`, or show us the class. – Petr Janeček Jul 21 '20 at 21:22
  • "No Unicode mapping for summationdisplay" means that these glyphs don't have an unicode to extract, but that has nothing to do with your exception. – Tilman Hausherr Jul 22 '20 at 08:39
  • @PetrJaneček thank you for you comment, I've added the class. – Emma Jul 23 '20 at 20:28
  • @TilmanHausherr Thank you for your comment, but i'm sure that the glyphs are fine; because i did extract the text in them and it worked fine! – Emma Jul 23 '20 at 20:30
  • Re unicode, I would have to see the PDF. Probably most of the text extracts, but not the "summationdisplay" glyph. Re that exception: find out why "sectionsIndex.get(i)" is "-6237". Maybe "sectionsIndex" is filled at more places than "sectionsIndex.add(prevstop);" where IMHO it can only be >= 0. – Tilman Hausherr Jul 24 '20 at 12:41
  • @TilmanHausherr Thank you for your reply. How do you reunicode a pdf file? i have a dataset containing hundreds of scientific papers in pdf, i don't think it can be edited! – Emma Aug 13 '20 at 12:33
  • Difficult https://stackoverflow.com/questions/39485920/ – Tilman Hausherr Aug 13 '20 at 13:06

0 Answers0