Update. (+18d) edited title and provided answer addressing original question.
tl/dr
I am indexing HTML pages and dumping the <p>...</p>
content as a snippet for search query returns. However, I don't want / need all that content (just the context around the query matched text).
Background
With these in my [classic] schema,
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true" multiValued="true">
<field name="p" type="text_general" indexed="true" stored="true" multiValued="true"
omitNorms="true" termVectors="true" />
and these in my solrconfig.xml
<str name="queryAnalyzerFieldType">text_general</str>
<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text_general</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">256</int>
</lst>
...
<initParams path="/update/**,/query,/select,/spell">
<lst name="defaults">
<str name="df">_text_</str>
</lst>
</initParams>
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">p</str>
<str name="fmap.p">p</str>
<str name="processor">uuid,remove-blank,field-name-mutating,parse-boolean,
parse-long,parse-double,parse-date</str>
</lst>
</requestHandler>
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler>
<queryResponseWriter name="json" class="solr.JSONResponseWriter">
<!-- For the purposes of the tutorial, JSON responses are written as
plain text so that they are easy to read in *any* browser.
If you expect a MIME type of "application/json" just remove this override.
-->
<str name="content-type">text/plain; charset=UTF-8</str>
</queryResponseWriter>
I get this result [Solr Admin UI; facsimile shown here],
"p":["Sentence 1. Sentence 2. Sentence 3. Sentence 4. ..."]
In the source HTML document those sentences occur singly in p-tags, e.g. <p>Sentence 1.</p>
, <p>Sentence 1.</p>
, ...
Questions
How can I index them, singly? My rationale is that I want to display a snippet of the context around the search result target (not the entire p-tagged content).
Additionally, in the Linux
grep
command we can, e.g., return a line before and after the matched line (-C1
, context, argument). Can we do something similar, here?i.e., if the Solr query match is in Sentence 2, the snippet would contain Sentences 1-3?
I tried assigning unique id's to the p-elements (<p id="a">...</p> <p id="b">...</p>
but I just got this in Solr,
"p":["a Sentence 1. b Sentence 2. Sentence d 3. Sentence d 4. ..."]