-2

I have a SQL Server table with a column of type TEXT that would store candidate resumes in different format. RTF is the most common one but often we get resume data from a 3rd party converter which stores the resume as special characters (maybe Unicode or I don't know what they are).

  1. How do I search my table to find all the rows that have these special characters? For example the rows with id = 4,6,7, 9 etc. all are the records with special characters.

  2. What format are these special characters called? Unicode??

enter image description here

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Naupad Doshi
  • 496
  • 2
  • 5
  • 19
  • 2
    Depending on what you are defining as "special" characters, you might find [this SO accepted answer](https://stackoverflow.com/questions/3891331/find-non-ascii-characters-in-varchar-columns-using-sql-server) helpful. – Tim Biegeleisen Jan 13 '22 at 05:27
  • 3
    `ntext` and `text` data types will be removed in a future version of SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use `nvarchar(max)` or `varchar(max)` instead. [See details here](http://msdn.microsoft.com/en-us/library/ms187993.aspx) – marc_s Jan 13 '22 at 05:45
  • Do the special characters ever appear inside normal RTF documents? If not, wouldn't it be easier to search for records whose RESUME_TEXT doesn't start with `{\rtf1`? – AlwaysLearning Jan 13 '22 at 06:08
  • _I don't know_ Go find out. Your "solution" is fundamentally flawed if "anything" can be stored here and you have a requirement to "search it". Then think long and hard about what it means to search for "special characters" and what your actual goal is. Why do you need to search for these **at all**? When you find them, what do you intend to do? This sounds like a big [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – SMor Jan 13 '22 at 12:26
  • 1
    First try seeing what these characters are, you might do: ` select cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` for a sample row. If you use SSMS this should show up as a string of hexadecimal characters (each pair following 0x is a byte; one or more of these would represent a character, what character they represent depend on the encoding; we can try finding a pattern there using ASCII/Unicode tables. But they may have just uploaded a PDF document, for example; then the bytes are not encoding characters. – tinazmu Jan 13 '22 at 21:28

1 Answers1

1

Assuming that by "special" characters you mean anything outside the set of printable ASCII and certain common whitespace characters , you can try the following:

DECLARE @SpecialPattern VARCHAR(100) =
    '%[^'
    + CHAR(9) + CHAR(10) + CHAR(13) -- tab, CR, LF
    + CHAR(32) + '-' + CHAR(126) -- Range from space to last printable ASCII
    + ']%'

SELECT
    RESUME_TEXT,
    cast(left(cast(resume_text as varchar(max)),20) as varbinary(max))` -- Borrowed from userMT's comment
FROM RESUME
WHERE RESUME_TEXT LIKE @SpecialPattern COLLATE Latin1_General_Bin -- Use exact compare

You may get some false hits against some perfectly valid extended characters such as accented vowels, curly quotes, or m- and n- dashes that may exist in the text.

My first though is that the weird characters might be a UTF-8 BOM (hex EF, BB, BF), but the display didn't seem to match the how I would expect SQL Server to render them. The inverse dot isn't present at all in the default windows code page (1252).

We need at least some hex data (at least the first few bytes) to help further. Often, common binary file types have a recognizable signature in the first 3-5 bytes.

T N
  • 4,322
  • 1
  • 5
  • 18
  • 1
    Errata: I added COLLATE Latin1_General_Bin to the above to force an exact code sequence compare – T N Jan 14 '22 at 01:32