Traditional pdf indexing solution compared to graph-based version

Question

My intention is to index an arbitrary directory containing pdf files (among other file types) with keywords stored in a list. I have a traditional solution and I heard that graph based solutions using e.g. SimpleGraph could be more elegant/efficient and independent of directory structures.

What would a graph-based solution (e.g. SimpleGraph) look like?

Traditional solution

// https://stackoverflow.com/a/14051951/1497139
List<File> pdfFiles = this.explorePath(TestPDFFiles.RFC_DIRECTORY, "pdf");
List<PDFFile> pdfs = this.getPdfsFromFileList(pdfFiles);
…
for (PDFFile pdf:pdfs) {
     // https://stackoverflow.com/a/9560307/1497139
     if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(pdf.getText(), keyWord)) {
          foundList.add(pdf.file.getName()); // here we access by structure (early binding)
          // - in the graph solution by name (late binding)
     }
}

Could you please add this as an issue to the simplegraph project? — Wolfgang Fahl, Aug 11 '18 at 13:24
https://github.com/BITPlan/com.bitplan.simplegraph/issues/16 — pdvsofismo, Aug 11 '18 at 13:33

Wolfgang Fahl · Accepted Answer · 2018-08-11T20:04:58.553

Basically with SimpleGraph you'd use a combination of the modules

FileSystem
PDFSystem

With the FileSystem module you collect your graph of files in the directory and filter it to include only files with the extension pdf - then you analyze the PDFs using the PDFSystem to get the page/text structure - there is already a test case for this in the simplegraph-bundle module showing how it works with some RFC pdfs as input.

TestPDFFiles.java

I have now added the indexing test see below.

The core functionality has been taken from the old test with searching for a single keyword and allowing this as a parameter:

List<Object> founds = pdfSystem.g().V().hasLabel("page")
      .has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
      .dedup().values("name").toList();

This is a gremlin query that will do most of the work by searching in a whole tree of PDF files with just one call. I consider this more elegant since you do not have to care about the structure of the input (tree/graph/filesystem/database, etc ...)

JUnit Testcase

 @Test
  /**
   * test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
   */
  public void testPDFIndexing() throws Exception {
    FileSystem fs = getFileSystem(RFC_DIRECTORY);
    int limit = Integer.MAX_VALUE;
    PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
    Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
        "proposal", "plan");
    // debug=true;
    if (debug) {
      for (Entry<String, List<String>> indexEntry : index.entrySet()) {
        List<String> fileNameList = indexEntry.getValue();
        System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
            fileNameList.size(), fileNameList));
      }
    }
    assertEquals(14,index.get("ARPA").size());
    assertEquals(9,index.get("plan").size());
    assertEquals(8,index.get("proposal").size());
  }

P.S. I have used the https://stackoverflow.com/questions/50314987/gremlin-text-comparison-predicates RegexPredicate — Wolfgang Fahl, Aug 11 '18 at 13:53

Traditional pdf indexing solution compared to graph-based version

1 Answers1