1

I am following the post Creating an index Nest and trying to update my index settings. All runs fine however the html_strip filter is not stripping HTML. My code is

var node = new Uri(_url + ":" + _port);
var settings = new ConnectionSettings(node);
settings.SetDefaultIndex(index);
_client = new ElasticClient(settings);

//to apply filters during indexing use folding to remove diacritics and html strip to remove html
_client.UpdateSettings(
        f = > f.Analysis(descriptor = > descriptor
                .Analyzers(
                        bases = > bases
                        .Add("folded_word", new CustomAnalyzer
                        {
                        Filter = new List < string > { "icu_folding", "trim" },
                                Tokenizer = "standard"
                        }
                        )
                        )
                .CharFilters(
                        cf = > cf.Add("html_strip", new HtmlStripCharFilter())
                        )
                )
        );      
Community
  • 1
  • 1
Ismail
  • 923
  • 2
  • 12
  • 29

1 Answers1

2

You are getting error:

Can't update non dynamic settings[[index.analysis.analyzer.folded_word.filter.0, index.analysis.char_filter.html_strip.type, index.analysis.analyzer.folded_word.filter.1, index.analysis.analyzer.folded_word.type, index.analysis.analyzer.folded_word.tokenizer]] for open indices[[my_index]]

Before you will try to update settings, close index first, update settings and reopen afterwards. Have a look.

client.CloseIndex(..);

client.UpdateSettings(..);

client.OpenIndex(..);

UPDATE

Add html_strip char filter to you custom analyzer:

.Analysis(descriptor => descriptor
                    .Analyzers(bases => bases.Add("folded_word",
                        new CustomAnalyzer
                        {
                            Filter = new List<string> { "icu_folding", "trim" }, 
                            Tokenizer = "standard", 
                            CharFilter = new List<string> { "html_strip" }
                        }))
                )

Now you can run test to check if this analyzer returns correct tokens:

client.Analyze(a => a.Index(indexName).Text("this <a> is a test <div>").Analyzer("folded_word"));

Output:

this
is
a
test

Hope it helps.

Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
Rob
  • 9,664
  • 3
  • 41
  • 43
  • Rob, many thanks your suggestion worked I can see the filter however during indexing html is not being stripped. – Ismail Jun 19 '15 at 13:53
  • @Ismail may you share index mapping? – Rob Jun 19 '15 at 13:54
  • `{ umbracotest: { settings: { index: { uuid: "eb3hMpFrS8qyb3DxHZ4_eg", analysis: { char_filter: { html_strip: { type: "html_strip" } } }, number_of_replicas: "1", number_of_shards: "5", version: { created: "1020099" } } } } }` – Ismail Jun 19 '15 at 13:57
  • Rob, your update and testing with Analyze works many thanks. However when I index the html is still there. When doing Index do you have to pass in which analyser to use? I am assuming it infers from what is set during client init? – Ismail Jun 22 '15 at 11:39
  • 1
    @Ismail I think I understand your concerns right now. Your content with html tags has been indexed using folded_word analyzer, but what you are getting is the original content not indexed tokens. Hope it's clear enough. [Here](https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html) you can find more info how elasticsearch works under the hood. – Rob Jun 22 '15 at 11:53
  • [This](http://stackoverflow.com/questions/15299799/elasticsearch-impact-of-setting-a-not-analyzed-field-as-storeyes/15320692#15320692) one should be quite useful too. – Rob Jun 22 '15 at 12:02
  • Rob,ah makes sense I was querying with sense and still seeing the html. Once again many thanks for your help – Ismail Jun 22 '15 at 12:13