0

My ASP.Net application receives an uploaded Excel file, then converts to XML to pass as a stored procedure parameter for processing. This file contains lots of entries that are text-matched to find the corresponding value.

Today (after working perfectly for years) three seemingly identical values caused mixed results on the text match. I can only illustrate with an example:

Results

  • The first value is the original value within Excel (copied from the spreadsheet).
  • The second value is that which the web application saves to XML, and is the raw value taken from the profiler showing the parameter values
  • The third value is that which currently exists in the table

The code is pasted from SSMS:

SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from Excel
SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from xml
SELECT SkillGroupTitle FROM tbl_SkillGroups WHERE SkillGroupTitle = N'Quality Assurance Instruction (QAI)'; -- pasted from SQL table (existing value)

Can somebody please advise what is happening here? All values appear identical visually, but there's clearly something different within the inbound values from Excel.

Update

Pasting the two values into a hex converter, there are indeed three differences.

Excel data:

51 75 61 6c 69 74 79 a0 41 73 73 75 72 61 6e 63 65 a0 49 6e 73 74 72 75 63 74 69 6f 6e 20 28 51 41 49 29 0a
                     --                            --                                                    --

SQL data:

51 75 61 6c 69 74 79 20 41 73 73 75 72 61 6e 63 65 20 49 6e 73 74 72 75 63 74 69 6f 6e 20 28 51 41 49 29

Can anyone shed any light here please?

EvilDr
  • 8,943
  • 14
  • 73
  • 133
  • Hmm interesting. I'll update the question with the findings (although I'm still clueless what these differences are within the string) – EvilDr Mar 29 '21 at 16:40
  • 1
    a0 is a non-breaking space: https://en.wikipedia.org/wiki/Non-breaking_space. You are going to need to clean the string before you send it off to be queried. – JMabee Mar 29 '21 at 16:47
  • 1
    https://stackoverflow.com/questions/15214848/how-do-i-remove-non-breaking-spaces-from-a-column-in-sql-server – JMabee Mar 29 '21 at 16:49
  • Amazing pointers, thank you. I got it to match with `REPLACE(N'Quality Assurance Instruction (QAI)', NCHAR(0x00A0), NCHAR(0x0020))`. Please migrate your comment as an answer if you can. – EvilDr Mar 29 '21 at 18:07

1 Answers1

1

Updated

First, to identify hidden characters causing strings not to be equal you can leverage ngrams8k like this:

WITH tbl_SkillGroups(dSource, SkillGroupTitle) AS 
(
SELECT 'Excel', N'Quality Assurance Instruction (QAI)' -- pasted from Excel
UNION ALL
SELECT 'XML',   N'Quality Assurance Instruction (QAI)'+CHAR(10) -- pasted from xml
UNION ALL
SELECT 'SQL',   N'Quality Assurance Instruction (QAI)'+CHAR(13) -- pasted from SQL table (existing value)
)
SELECT 
  [Source]   = ts.dSource,
  Position   = ng.Position,
  Token      = ng.Token, 
  asciiValue = ASCII(ng.Token)
FROM        tbl_SkillGroups                     AS ts
CROSS APPLY samd.ngrams8k(ts.SkillGroupTitle,1) AS ng
--WHERE       ng.Position > 32 -- Zoom into the last few characters

Returns:

Source Position  Token   asciiValue
------ --------- ------- -----------
Excel  33        A       65
Excel  34        I       73
Excel  35        )       41  -- Only 35 characters 
XML    33        A       65
XML    34        I       73
XML    35        )       41
XML    36                10 -- 36th character is a CHAR(10) (Looks like a space)
SQL    33        A       65
SQL    34        I       73
SQL    35        )       41
SQL    36                13 -- 36th character is a CHAR(13) (Also looks like a space)

NEXT, to clean hidden characters from your inputs you can use PatReplace8k.

WITH tbl_SkillGroups(dSource, SkillGroupTitle) AS 
(
SELECT 'Excel', N'Quality Assurance Instruction (QAI)' -- pasted from Excel
UNION ALL
SELECT 'XML',   N'Quality Assurance Instruction (QAI)'+CHAR(10) -- pasted from xml
UNION ALL
SELECT 'SQL',   N'Quality Assurance Instruction (QAI)'+CHAR(13) -- pasted from SQL table (existing value)
)
SELECT SkillGroupTitle
FROM tbl_SkillGroups as ts
CROSS APPLY PatReplace8K(ts.Skillgrouptitle,'[^a-zA-Z ()]','') as pr
WHERE SkillGroupTitle = pr.NewString

Returns:

SkillGroupTitle
------------------------------------
Quality Assurance Instruction (QAI)
Quality Assurance Instruction (QAI)
Quality Assurance Instruction (QAI)

Here, PatReplace8k would remove any characters that didn't match the pattern (letters, spaces and parentheses) thus making these three values equal.

Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • Thanks. The challenge with this is that the table accepts characters from different languages (e.g. English and Norwegian). I'm not sure if that makes the Regex pattern unfeasible...? – EvilDr Mar 30 '21 at 08:28