I have started experimenting with Elasticsearch ingest pipelines and processors as a possibly faster way to build what I can describe as an "inverted index".
Here's what I'm trying to do: I have a documents index. Each document is akin to the following:
{
"id": "DOC1",
"title": "Quiz no. 1",
"questions": [
{
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
}
I am trying to understand if there is a magic combination of reindex, pipelines and processors that can allow me to automatically build a questions index. Here's an example of what that index would look like:
[
{
"question_id": "<randomly-generated-value-1>",
"document_id": "DOC1",
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question_id": "<randomly-generated-value-2>",
"document_id": "DOC1",
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
In the Elasticsearch documentation, it's mentioned you can perform a REINDEX using a specific pipeline. Looking up the simulate pipeline docs, I'm trying a few processors, including the foreach one, but I can't understand if the resulting documents from the pipeline are still 1:1 to the original index or 1 source document can generate multiple destination documents (which is what I need).
Here's the simulated pipeline I'm trying:
{
"pipeline": {
"description": "Inverts the documents index into a questions index",
"processors": [
{
"rename": {
"field": "id",
"target_field": "document_id",
"ignore_missing": false
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.question",
"target_field": "question"
}
}
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.choices",
"target_field": "choices"
}
}
}
},
{
"remove": {
"field": "questions"
}
}
]
}
}
This is almost working. The problem with this approach is that there is only one resulting document that corresponds the first question. The second question is not present in the output of the simulated pipeline, hence my doubt whether a pipeline of processors can output multiple destination documents reading 1 source document, or we are forced to maintain a 1:1 relationship.