5

I need to use a full text search to get the rows where the column contains the word O'henry

My code:

$word = "O'henry";
$sql = "SELECT * FROM mytable WHERE to_tsvector('english', col) @@ to_tsquery(:word) ";

$sth = $db->prepare( $sql );
$sth->execute(
    array(":word"=>$word)
);

I have two problems:

1) This query matches columns which contains the word O'henry, but it also matches columns containing, for example: "O yes, thierry henry is good sportsman."

2) If $word begins with quotes, for example 'henry, I got an error: syntax error in tsquery: "'henry", in spite of the fact that the search string was already escaped.

How can I solve this problem?

Expedito
  • 7,771
  • 5
  • 30
  • 43
Oto Shavadze
  • 40,603
  • 55
  • 152
  • 236

2 Answers2

4

The default english dictionary is going to tokenize your data treating the ' as a it would a space. You can use ts_debug to inspect how PostgreSQL/tsearch will treat your text:

psql=# SELECT * FROM ts_debug('english','o''henry');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | o     | {english_stem} | english_stem | {o}
 blank     | Space symbols   | '     | {}             |              | 
 asciiword | Word, all ASCII | henry | {english_stem} | english_stem | {henri}
(3 rows)

It's clear from this output that if you want to have postgres treat o'henry as a single word, you are going to have to either:

  • create a custom dictionary to handle words with apostrophes
  • or strip apostrophes from both your tsvector and tsquery before use.

I'd say the second option is by far the easiest:

$sql = "SELECT * FROM mytable WHERE plainto_tsvector('english', replace(col, '''','')) @@ to_tsquery(replace(:word,'''',''));"
Chris Farmiloe
  • 13,935
  • 5
  • 48
  • 57
  • Do you know how to create a custom dictionary for it? I already have a custom one for removing accents so it'd be nice to bake this into it as well. – rutchkiwi May 14 '21 at 16:46
3

After reading through the documentation, I believe that this is a limitation within the types of strings accepted by to_tsquery. According to this page:

the input to to_tsquery must already follow the general rules for tsquery input

The tsquery input rules are defined here. Based on the examples given, the engine processes each input word into single-quoted strings, which will break on your input. As a workaround, I found a suggestion in this forum post (where the user was trying to properly escape the input 'ksan:

select *
from items
where to_tsvector(name) @@ to_tsquery(E'[\']ksan')

To apply this to your situation, you would need to pre-format your input using something like this:

$word = 'E' . str_replace("'", "[\']", $word);
George Cummins
  • 28,485
  • 8
  • 71
  • 90
  • Thanks very much, this solves 2 problem (in my question), but 1) problem still stays – Oto Shavadze May 13 '13 at 18:47
  • Have you verified this by testing? As I understand it, the first problem is caused because your search term is being split into two strings ('O' and 'henry'), and this new method should prevent that. – George Cummins May 13 '13 at 18:48
  • Yes, tested, this query: `... WHERE to_tsvector('english', col) @@ to_tsquery(E'O[\']henry')` matches also that columns, which contains text: `O some words here henry` – Oto Shavadze May 13 '13 at 19:04
  • Hm, I see. I will continue looking for a solution, but based on open question such as [this one](http://stackoverflow.com/questions/4997943/postgresql-fulltext-search-for-words-with-apostrophes), I am not overly hopeful. – George Cummins May 13 '13 at 19:09