Basically with SimpleGraph you'd use a combination of the modules
- FileSystem
- PDFSystem
With the FileSystem module you collect your graph of files in the directory and filter it to include only files with the extension pdf - then you analyze the PDFs using the PDFSystem to get the page/text structure - there is already a test case for this in the simplegraph-bundle module showing how it works with some RFC pdfs as input.
TestPDFFiles.java
I have now added the indexing test see below.
The core functionality has been taken from the old test with searching for a single keyword and allowing this as a parameter:
List<Object> founds = pdfSystem.g().V().hasLabel("page")
.has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
.dedup().values("name").toList();
This is a gremlin query that will do most of the work by searching in a whole tree of PDF files with just one call. I consider this more elegant since you do not have to care about the structure of the input (tree/graph/filesystem/database, etc ...)
JUnit Testcase
@Test
/**
* test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
*/
public void testPDFIndexing() throws Exception {
FileSystem fs = getFileSystem(RFC_DIRECTORY);
int limit = Integer.MAX_VALUE;
PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
"proposal", "plan");
// debug=true;
if (debug) {
for (Entry<String, List<String>> indexEntry : index.entrySet()) {
List<String> fileNameList = indexEntry.getValue();
System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
fileNameList.size(), fileNameList));
}
}
assertEquals(14,index.get("ARPA").size());
assertEquals(9,index.get("plan").size());
assertEquals(8,index.get("proposal").size());
}