I have a table in the database having the column question_description
. I have indexed the column in sphinx
, and getting the results successfully. Now problem is that the column contains encoded html
along with text, and I want from sphinx to only search in text ignoring the encoded html
. How can i configure this requirement. Thanks!

- 713
- 1
- 9
- 31

- 11
- 3
1 Answers
Not totally sure what you mean by 'encoded' html, does that mean gziped or something?
But do see html_strip
:
http://sphinxsearch.com/docs/current.html#conf-html-strip
HTML tags are removed, their contents (i.e., everything between
<P>
and</P>
) are left intact by default.
Edited to add (too long for a comment!):
Eek, yes, ok you do have 'encoded' HTML (in this case using html entities) - sphinx DOESN'T have explicit support for decoding that.
It can 'strip' down plain html, not encoded html. Frankly encoding it like that seems to add lots of extra overhead (your real html entities, will then be DOUBLE encoded) and means you always need to decode it when 'using' the HTML (be that outputting it in a webpage, or to sphinx etc).
This would have to use XMLPipe2
index (or other pipe index) to decode the text for indexing, (will be quite complicated as will have to decode the htmlspecialchars, but then re-encode it as XML)
or maybe find a MySQL function to decode it is there a mysql function to decode html entities? - during the sql_query
Second Edit to add:
Actuilly checking http://php.net/manual/en/function.htmlspecialchars.php - it seems the htmlspecialchars
only really does 5 transformations.
That might be preactical to fix with regexp_filter
- you could replace the entities back with their unencoded version.
the Regexp filters are applied BEFORE html processing... http://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
http://sphinxsearch.com/docs/current.html#conf-regexp-filter
regexp_filter = " => "
... etc
regexp_filter = & => &

- 20,886
- 3
- 30
- 43
-
I am storing the output from a rich text box editor in the database. The output contains html tags as well. So to save them in table i am using htmlspecialchar_encode() in php. The encode example is as follow. "<p>Automated Welding Services, Inc. (AWS) bu...... bla bla bla". No I need "<p>" skip this one and search in Automated Welding Services, Inc.". I hope I made my point clear. – Bilal_Ahmad Dec 08 '17 at 13:47