I am trying to extract all the text from a PDF and store it inside a HashSet. As I know, HashSet does not contain duplicates so it will ignore the duplicates when I extract them. However, when I print out the results of the hash, I noticed there's duplicate blank space in it.
I want to insert the hash values into my table in MySQL but it has a primary key constraint so that gives me some trouble. Is there a way I could remove entirely all sorts of duplicate in my hash?
My code to extract the text :
public static void main(String[] args) throws Exception {
String path ="D:/PDF/searchable.pdf";
HashSet<String> uniqueWords = new HashSet<>();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
System.out.println(uniqueWords);
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
System.out.println(words[1].toString());
MysqlAccess connection=new MysqlAccess();
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
This is my hash:
[, highlight, of, Even, copy, file,, or, ., ,, 1, reader,, different, D, F, ll, link, ea, This, ed, document, V, P, ability, regardless, g, d, text., e, b, a, n, o, web, l, footnote., should, Most, IDRH, selection, text-searchable, positioning, u, s, what, r, PDF., happens, er, y, x, to, body, single, ca, te, together, ti, th, would, when, be, Text-Searchable, document,, text, isn't, such, kinds, sh, co, ld, font,, example, ch, this, attempt, have, t,, Notice,, contained, from, re, text.1, page,, style, page., able, if, is, You, standard, PDF, your, as, readers, you, the, in, main, an, iz]
If they are unique, why does it throws " Duplicate entry for key PRIMARY"
when I try to insert into a primary key column?
Any suggestion would be appreciated.