You could use an ingest pipeline with a script processor to extract the link text:
1. Set up the pipeline
PUT _ingest/pipeline/clean_links
{
"description": "...",
"processors": [
{
"script": {
"source": """
if (ctx["content"] == null) {
// nothing to do here
return
}
def content = ctx["content"];
Pattern pattern = /\[([^\]\[]+)\](\(((?:[^\()]+)+)\))/;
Matcher matcher = pattern.matcher(content);
def purged_content = matcher.replaceAll("$1");
ctx["purged_content"] = purged_content;
"""
}
}
]
}
The regex can be tested here and is inspired by this.
2. Include the pipeline when ingesting the docs
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink2](another_page.md)"
}
The python docs are here.
3. Verify
GET my-index/_search?filter_path=hits.hits._source
should yield
{
"hits" : {
"hits" : [
{
"_source" : {
"purged_content" : "Mylink anotherLink",
"content" : "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
},
{
"_source" : {
"purged_content" : "Mylink2",
"content" : "[Mylink2](another_page.md)"
}
}
]
}
}
You could instead replace the original content
if you want to fully discard them from your _source
.
In contrast, you could go a step further in the other direction and store the text + link pairs in a nested field of the form:
{
"content": "...",
"links": [
{
"text": "Mylink",
"href": "https://link-url-here.org"
},
...
]
}
so that when you later decide to make them searchable, you'll be able to do so with precision.
Shameless plug: you can find other hands-on ingestion guides in my Elasticsearch Handbook.