2

original table contains json, but i've stripped it down to the table below:

id json
1 "name":"one.it.two"
2 "name": "one.it.two"

difference between the two rows is the space after :
catalog has no stopwords.
searching for CONTAINS (json, 'it') return both rows.
searching for CONTAINS (json, 'two') return both rows.
searching for CONTAINS (json, 'one') returns only the second row.

why does searching for one not return the first row?


i've reduced the test case even further. thanks to @RobinWebb

this is no more a json or delimited text issue.

id text1
1 name:first.it
2 name: first.it

difference between the two rows is the space after :

searching for first does not return the first row.

search works if i change first.it to first.and


thanks to @AlwaysLearning, this is an issue with the word breaker

results from sys.dm_fts_parser is not consistent:

text words
name:first.it name:first.it
name
:first
it
name:first.and name
first.and
first
and
name:first,it name
first
it

i used SELECT * FROM sys.dm_fts_parser ('"<text>"', 1033, NULL, 0)

hmad
  • 159
  • 8
  • If the values are JSON, why not actually parse the data as JSON? – Thom A Jun 23 '22 at 12:42
  • parsing json to extract `name` and doing a search using `LIKE` is slow. – hmad Jun 23 '22 at 13:04
  • 1
    Because it seems that you are storing delimited data in your JSON too; so you have a doubly problematic design. Full text search isn't built for use against JSON, so I'm not surprise it isn't working as you want. If you want performant solution, then likely you shouldn't be storing JSON data with delimited data in it at all, but using a normalised approach. – Thom A Jun 23 '22 at 13:08
  • 1
    Implementation issues (above) aside...I don't think CONTAINS is working in the way that you expect it to. It isn't the same as LIKE. I have experimented with your example in a test database with FT enabled and have recreated the issue BUT I am vexed to explain precisely why your first example isn't returning a result. Simply adding spaces and removing double quotes to the values yields different outcomes. Sometimes it works and sometimes it doesn't. Very frustrating. – Robin Webb Jun 23 '22 at 14:00
  • Either way: you are using the wrong tool for the job. Either leave JSON as it is, and use computed columns for performance, or parse it out properly into a normalized design. What you have at the moment is not good for Full-Text nor for JSON. You can parse out the JSON if you fix it up, see https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=f4469a70479b841f20e8343633001ace – Charlieface Jun 23 '22 at 14:05
  • 1
    Keeping your post on topic. FT is looking for a word, prefix of a word, a word near another word, synonym of a word, etc... https://www.mytecbits.com/microsoft/sql-server/like-vs-contains I believe that the reason that "name":"one.it.two" isn't being found is because this is being interpreted/indexed as "name:one.it.two". Putting a space after the colon makes "one" appear as a word"name: one.it.two". Perhaps this might help? I think if you pursue this avenue of investigation you will make sense of this. – Robin Webb Jun 23 '22 at 14:32
  • 1
    thanks for this @RobinWebb i've modified the question. – hmad Jun 23 '22 at 15:35
  • 1
    this is not a json or delimited text issue anymore. the search fails even with plain text. see update to the question. pretty sure a fix for this will resolve the original issue with json with delimited text. – hmad Jun 23 '22 at 17:31
  • 1
    It really depends on your full-text catalog and wordbreaker, but compare the output of `SELECT * FROM sys.dm_fts_parser (' "name:first.it" ', 1033, 0, 0);` and `SELECT * FROM sys.dm_fts_parser (' "name: first.it" ', 1033, 0, 0);`... likely the first one returns `:first` as a word, which is why it doesn't match `first` as a search term. – AlwaysLearning Jun 25 '22 at 11:42
  • @AlwaysLearning updated the question with this finding. is wordbreaker customizable? its parsing `:first` as a word, any way around this? – hmad Jun 27 '22 at 14:13

1 Answers1

0

Based on the info provided in this answer https://dba.stackexchange.com/a/65845/94130 it seems that .it is treated as a special word (possibly a top level domain) by word breaker.

I can only infer that this "special word logic" has a b̶u̶g̶ feature in it, where : is treated as part of the name. Examples:

SELECT * FROM sys.dm_fts_parser (' "name:first.it" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.net" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.com" ', 1033, 0, 0);
SELECT * FROM sys.dm_fts_parser (' "name:first.gov" ', 1033, 0, 0);

Notice that it always returns one extra result. I assume it includes the extra line when it thinks that part of the string is a URL.

special_term display_term expansion_type source_term
Exact Match name:first.gov 0 name:first.gov
Exact Match name 0 name:first.gov
Exact Match :first 0 name:first.gov
Exact Match gov 0 name:first.gov

Note that some words are not affected:

SELECT * FROM sys.dm_fts_parser (' "name:first.he" ', 1033, 0, 0);

I have modified code provided in https://dba.stackexchange.com/a/25848/94130 to get all characters that are treated this way.

declare @i integer
declare @cnt integer
set @i=0
while @i<255
begin
  set @cnt=0
  select @cnt=COUNT(1) FROM sys.dm_fts_parser ('"name'+REPLACE(CHAR(@i),'"','""')+'first.net"', 1033, 0, 0)
  WHERE display_term = CHAR(@i) + 'first'
  if @cnt=1
    begin
      print 'this char - '+CASE WHEN @i > 31 THEN char(@i) ELSE '' END+' - char('+convert(varchar(3),@i)+') is included'
    end
  set @i=@i+1
end 

Output:

this char -  - char(0) is included
this char - : - char(58) is included
this char - ­ - char(173) is included

There is a Microsoft article explaining how to switch word breakers/stemmers, which may (or may not) solve this but I have not tried this.

Note: Above code was executed on Win 11 and SQL 2019 Dev

Alex
  • 4,885
  • 3
  • 19
  • 39