1

Using an ingest pipeline, I want to iterate over a HashMap and remove underscores from all string values (where underscores exist), leaving underscores in the keys intact. Some values are arrays that must further be iterated over to do the same modification.

In the pipeline, I use a function to traverse and modify the values of a Collection view of the HashMap.

PUT /_ingest/pipeline/samples
{
    "description": "preprocessing of samples.json",
    "processors": [
        {
            "script": {
                "tag": "remove underscore from sample_tags values",
                "source": """
                    void findReplace(Collection collection) {
                    collection.forEach(element -> {
                        if (element instanceof String) {
                            element.replace('_',' ');
                        } else {
                            findReplace(element);
                        }
                        return true;
                        })
                    }

                    Collection samples = ctx.samples;
                    samples.forEach(sample -> { //sample.sample_tags is a HashMap
                        Collection sample_tags = sample.sample_tags.values();
                        findReplace(sample_tags);
                        return true;
                    })
                """
            }
        }
    ]
}

When I simulate the pipeline ingestion, I find the string values are not modified. Where am I going wrong?

POST /_ingest/pipeline/samples/_simulate
{
    "docs": [
        {
            "_index": "samples",
            "_id": "xUSU_3UB5CXFr25x7DcC",
            "_source": {
                "samples": [
                    {
                        "sample_tags": {
                            "Entry_A": [
                                "A_hyphentated-sample",
                                "sample1"
                            ],
                            "Entry_B": "A_multiple_underscore_example",
                            "Entry_C": [
                                        "sample2",
                                        "another_example_with_underscores"
                            ],
                            "Entry_E": "last_example"
                        }
                    }
                ]
            }
        }
    ]
}

\\Result

{
  "docs" : [
    {
      "doc" : {
        "_index" : "samples",
        "_type" : "_doc",
        "_id" : "xUSU_3UB5CXFr25x7DcC",
        "_source" : {
          "samples" : [
            {
              "sample_tags" : {
                "Entry_E" : "last_example",
                "Entry_C" : [
                  "sample2",
                  "another_example_with_underscores"
                ],
                "Entry_B" : "A_multiple_underscore_example",
                "Entry_A" : [
                  "A_hyphentated-sample",
                  "sample1"
                ]
              }
            }
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-12-01T17:29:52.3917165Z"
        }
      }
    }
  ]
}

Jonathan
  • 125
  • 1
  • 9

2 Answers2

2

Here is a modified version of your script that will work on the data you provided:

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          String replaceString(String value) {
            return value.replace('_',' ');
          }
      
          void findReplace(Map map) {
            map.keySet().forEach(key -> {
              if (map[key] instanceof String) {
                  map[key] = replaceString(map[key]);
              } else {
                  map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
              }
            });
          }

          ctx.samples.forEach(sample -> {
              findReplace(sample.sample_tags);
              return true;
          });
          """
      }
    }
  ]
}

The result looks like this:

     {
      "samples" : [
        {
          "sample_tags" : {
            "Entry_E" : "last example",
            "Entry_C" : [
              "sample2",
              "another example with underscores"
            ],
            "Entry_B" : "A multiple underscore example",
            "Entry_A" : [
              "A hyphentated-sample",
              "sample1"
            ]
          }
        }
      ]
    }
Val
  • 207,596
  • 13
  • 358
  • 360
  • Nice, I did not know you can replace in-place. +1 – Joe - GMapsBook.com Dec 02 '20 at 11:28
  • @JoeSorocin note that the solution is not replacing "in place" per se, `map[key]` is still being reassigned for each key like in a normal assign statement. – Val Dec 02 '20 at 12:04
  • True but findReplace is still void, isnt it? – Joe - GMapsBook.com Dec 02 '20 at 14:12
  • @JoeSorocin yes, well I guess that depends on the definition of "in place" :-) To me in place means that you're iterating on the values themselves and change them as if you had a pointer to it, which is not possible in Java (thank god) – Val Dec 02 '20 at 14:13
  • I hear you ;) It's still hard to understand why changes to the the `Map map` argument mirror in the original `ctx` but oh well. I'll stick to python I guess LOL – Joe - GMapsBook.com Dec 02 '20 at 19:39
  • Hi again, @val. I've tried the pipeline script you provided, and tested with the simulation I provided, on my system with Elasticsearch version 7.9.3 running, Only Entries B & E had the underscores removed; the Entries A & C with arrays did not. Have you any idea why I would get a result that is different than yours? – Jonathan Dec 02 '20 at 23:04
  • Can you show the document that you're talking about? – Val Dec 03 '20 at 05:05
  • @Val. I copied The script you provided in your answer, and the pipeline simulation I used in my question. I'm not sure how to show you the document. – Jonathan Dec 03 '20 at 11:11
  • Well, I'm just asking to see the sample document with Entries A & C unmodified – Val Dec 03 '20 at 11:37
  • "samples" : [ { "sample_tags" : { "Entry_D" : "last example", "Entry_C" : [ "sample2", "another_example_with_underscores" ], "Entry_B" : "A multiple underscore example", "Entry_A" : [ "A_hyphentated-sample", "sample1" ] } } ] }, "_ingest" : { "timestamp" : "2020-12-02T22:27:31.2185779Z" } ` – Jonathan Dec 03 '20 at 12:35
  • Thanks @Val, your update works. Now I'll work on figuring out how your script works, I'm new to this! – Jonathan Dec 03 '20 at 13:07
  • Awesome, that's great news! – Val Dec 03 '20 at 13:12
1

You were on the right path but you were working on copies of values and weren't setting the modified values back onto the document context ctx which is eventually returned from the pipeline. This means you'll need to keep track of the current iteration indexes -- so for the array lists, as for the hash maps and everything in between -- so that you can then target the fields' positions in the deeply nested context.

Here's an example taking care of strings and (string-only) array lists. You'll need to extend it to handle hash maps (and other types) and then perhaps extract the whole process into a separate function. But AFAIK you cannot return multiple data types in Java so it may be challenging...

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          ArrayList samples = ctx.samples;
        
          for (int i = 0; i < samples.size(); i++) {
              def sample = samples.get(i).sample_tags;
              
              for (def entry : sample.entrySet()) {
                  def key = entry.getKey();
                  def val = entry.getValue();
                  def replaced_val;
                  
                  if (val instanceof String) {
                    replaced_val = val.replace('_',' ');
                  } else if (val instanceof ArrayList) {
                    replaced_val = new ArrayList();
                    for (int j = 0; j < val.length; j++) {
                        replaced_val.add(val[j].replace('_',' ')); 
                    }
                  } 
                  // else if (val instanceof HashMap) {
                    // do your thing
                  // }
                  
                  // crucial part
                  ctx.samples[i][key] = replaced_val;
              }
          }
        """
      }
    }
  ]
}
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68