0

I am trying to analyze a field and build a query in Elasticsearch that finds similar filenames based on a filename provided into the query.

For Example: If I have a filename 'invoice_1234.pdf', I'd like to find more filenames like 'invoice_3456.pdf', but not find files like 'inv_123456.pdf' or 'bigCompany.invoice.1234.pdf'.

I should probably expand upon this. My index stores the filename:

{
    "documents": {
        "mappings": {
            "properties": {
                "filename" : {
                    "type" : "text",
                    "fields" : {
                      "reverse" : {
                        "type" : "text",
                        "analyzer" : "filename_reverse"
                      }
                    }
                }
            }
        }
    }
}

In code, a filename may be called as part of the query:

GET /documents/_search
{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "filename": {
                          "query": "potentially_any.filename.pdf"
                        }
                    }
                } 
            ]
        }
    }
}

What I am trying to figure out is how to make a field analyzer/tokenizer that would be able to take the filename that is passed in and find more filenames like the filename passed in. It can be any filename.

I tried using the approach used in Filename search with ElasticSearch, but I found that this approach is excellent for finding filenames when I query a known part of the filename, but what I am actually trying to do is find other related filenames based on a filename that has been passed in. When I ran tests using the methods described above, my results typically were all or nothing. I would either get all files as a match (usually because they all end in 'pdf', or contain 'inv'), or I would get only the exact filename from the query.

Am I missing something in my approach, or is this a capability beyond Elasticsearch?

0 Answers0