2

I have a column with HTML content. I want to search for words in that column, but only the text, not the HTML code.

For example:

(1) <p class="last">First time I went there...</p>
(2) This is a <em>very</em> subtle colour.

(1) Searching for last doesn't find it, because it's a class name, not content.
(2) Searching for very subtle will find it, ignoring HTML

Is this possible with SQLite directly?

Note: I cannot define functions.

MPelletier
  • 16,256
  • 15
  • 86
  • 137
Francisc
  • 77,430
  • 63
  • 180
  • 276
  • 5
    I would recommend you add a column to your table that contains a "plain text" version of your HTML content. You would have to populate this column yourself, but once populated you can search across it. – Lily Ballard Sep 07 '11 at 23:28
  • 2
    A reliable solution using regex only would be extremely hard to code. – NullUserException Sep 07 '11 at 23:32
  • Kevin: That would double the database size. It's already a bit too large for its own good. – Francisc Sep 07 '11 at 23:38
  • 2
    @Francisc sounds like you're pushing the limits of SQLite. If you can't afford to add a new column, perhaps consider changing your RDBMS. It's far too [painful to painful to parse or regex through HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – p.campbell Sep 07 '11 at 23:43

2 Answers2

1

Don't do it with SQLite.

Do it with your programming language, your framework that is using SQLite.

In the table, where you have the column with the html code, add additional columns for data about the html. You will have to gather the data for the extra columns, while you analyze the html with your framework.

Track data about the structure the html format does have and save in an extra column the textual content of the html data.

You can get all tags by simple REGEX:

/<?[^<>]+>?/

Checkout how you receive data by scanning the html data for tags with the regexp above and write an iterated evaluation for tag-content (i.e. if a string in the results-array starts with a "<" it´s a tag, by scanning it with /<\s*\/\s*[^>]+>/ you will see if it is a ending tag and by scanning it with /<\s*[^\/>]+\s*\/\s*>/ you will see if it is a single closed tag. If none of the differentiated states does apply, it is textual content.

JAkk
  • 1,326
  • 2
  • 10
  • 12
0

There isn't a good way to do that in SQLite directly (you'd need to build a SQLite extension that would parse the HTML and let you search through it like MSSQL's XML field type).

Your best bet is going to be to parse the HTML in your code and write out all the text into a separate column to search on as @Kevin suggests in the comments.

E.g.

 ID | HTML                                   | Text
 ---------------------------------------------------------------------------
 1  | <p class="last">First time ...</p>     | First time ...
 2  | This is a <em>very</em> subtle colour. | This is a very subtle colour.
Sean Vieira
  • 155,703
  • 32
  • 311
  • 293