I have this code that extracts text from Scientific Paper (pdf) using PDFBox 2.0 .
public class Sectioning {
private static ArrayList<String> titles = new ArrayList<String>();
private static ArrayList<Integer> sectionsIndex = new ArrayList<Integer>();
private static HashMap<String, String> Sections = new HashMap<String,String>();
private static PDFManager pdfManager = new PDFManager();
public Sectioning() {
}
//This method takes the PDF file and send it to (extractText) in PDFSectionsTitle class to get the titles in the PDF file
public ArrayList<String> GetTitles(File file) throws FileNotFoundException
{
FileInputStream fis = new FileInputStream(file);
titles = extractText(fis);
return titles;
}
/*This method takes the PDF file and get its text then get the indexes of the titles in the text
then send the text to TextSections to get the titles and their sections and store them in a hashmap*/
public HashMap<String, String> Section(File file) throws IOException
{
pdfManager.setFilePath(file.getPath());
String text = pdfManager.toText();
int prevstop = 0;
for (int j = 0 ; j<=titles.size()-1 ; j++)
{
prevstop = text.indexOf(titles.get(j),prevstop);
sectionsIndex.add(prevstop);
}
TextSections(text);
return Sections;
}
//Store in a hashmap the titles with their paragraphs
public void TextSections(String text)
{
for(int i = 0 ; i <= sectionsIndex.size()-1;i++)
{
if(i == sectionsIndex.size()-1)
{
Sections.put(titles.get(i), text.substring(sectionsIndex.get(i)).replaceFirst(titles.get(i), "")); //for last title the paragraph is to the end of the file
}
else
{
Sections.put(titles.get(i), text.substring(sectionsIndex.get(i), sectionsIndex.get(i+1)).replaceFirst(titles.get(i), "")); //The paragraphs of the current title ends where the next title exists
}
}
}
public void clear() throws IOException{
titles.clear();
sectionsIndex.clear();
Sections.clear();
pdfManager.closeDoc();
}
When i execute the code it gives an excellent results, but on with SOME papers it gives me the exception below:
Jul 22, 2020 12:03:01 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Jul 22, 2020 12:03:03 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -6237
at java.lang.String.substring(String.java:1911)
at pdfpapersections.Sectioning.TextSections(Sectioning.java:61)
at pdfpapersections.Sectioning.Section(Sectioning.java:45)
at pdfpapersections.PDFPaperSections.main(PDFPaperSections.java:46)
Java Result: 1
Anybody has any idea why its giving me this error? even though the files are NOT corrupted and i have extracted their text using other methods!