6

My db background is on MS SQL Server side where text comparison in indexes and constraints is not case-sensitive (at least by default). So once you have a value "abc" assigned to a unique column, you can not store a second value "ABC" and if you search for "ABC" SQL Server will find "abc".

With Oracle things are different, so even with unique index on a text column you can store there both "abc" and "ABC", and if you search for "AbC" you won't get any result.

AFAIK prior to Oracle 10gR2 there was no way around it, now it's possible to set insensitive comparison per sesson which IMHO is not a good solution because everything depends on programmers' discipline.

But what's worst with case-sensitive lookup is that those who rewrite all searches as UPPER(some_column)=UPPER(some_text) (and this is what many discussion threads recommend) end with table scan even when there is an index on some_column. Performance implication is disastereous: I just tested a simple search on a table with half a million rows, and search with UPPER function call took 20 times longer than the search with just a column identifier, thus confirming that the index is not used when doing function-based search.

Is it really so that the most standard technique to do case-insensitive search in Oracle database is to apply UPPER/LOWER functions to search elements despite of bad performance? Or are there more elegant ways of addressing this issue?

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
Vagif Abilov
  • 9,835
  • 8
  • 55
  • 100
  • possible duplicate of [Make Oracle SQL LIKE query case insensitive](http://stackoverflow.com/questions/5391069/make-oracle-sql-like-query-case-insensitive) – Ben Feb 09 '13 at 11:32
  • Tbone, you're going around in circles, the guy says it is difficult to enforce "cleansing" standards, and I imagine it's a bit late with gig/tera bytes of data. You're not really giving a practical solution to the problem (maybe for a brand new system). I agree with him that case insensitive search is a very basic bit of functionality that isn't handled well. Setting `nls_comp` and `nls_sor` is one method, along with an associated index using same `nls_sort` value. However, prior to 11g, it would only work with equality operator and not with `LIKE`. Oracle text index is another method. It does – ojock Aug 09 '13 at 05:58

2 Answers2

7

Yes, use of of UPPER(some_column)=UPPER(some_text) really is the best way, but you can create an index on UPPER(some_column). That should alleviate the problem.

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
  • 1
    In case I create all indexes on UPPER(column) will it enforce: 1. unique constraint violation if I try to create antother column value with different casing? 2. I can use in search some_column=UPPER(some_text) instead of UPPER(some_column)=UPPER(some_text), which will improve performance? – Vagif Abilov Sep 19 '11 at 11:08
  • Answering my own question: yes to both of them. – Vagif Abilov Sep 19 '11 at 11:19
  • @Vagif - If you create an index on `UPPER(some_column)` and search on `SOME_COLUMN` without the function call, the function-based index will not be used and you'll be back to the performance of a full table scan. I'll wager that your test was using the old index for the search and the new function-based index to enforce uniqueness. You almost certainly don't want to have both a regular and a function-based index on the same column. – Justin Cave Sep 19 '11 at 13:57
  • You're right Justin, I still need to use an expression UPPER(column)=UPPER(value) in order to avoid table scan, but at least if I have indexes on both the value and uppercased value performance won't suffer. – Vagif Abilov Sep 19 '11 at 14:29
0

I would say create "clean" fields based on your company's business logic for cleaning these fields (a company name or address, for example, will have a surprising amount of cleaning logic around furniture words, usps rules, etc., not to mention third party cleaning routines if used).

So, for important search fields, keep BOTH the raw (unclean) and clean versions. If your cleaning logic changes significantly over time, you can go back and reclean based on raw values. Your searches (assuming you're not using a fuzzy logic engine like Oracle Text or Lucene) would hit the clean values.

For all other fields (that do not merit separate clean versions), I usually perform a minimum level of scrubbing. Capitalization, trimming, strip control chars, reducing multiple spaces to 1 space, etc. is all part of a set of basic cleansing routines. These are usually done before the data is loaded (in the data build programs).

So, come up with a set of data standards and stick to it, but I wouldn't just throw any ole junk in the db. Try not to think of the db as a dumping ground of data, and you'll get better results for sure.

tbone
  • 15,107
  • 3
  • 33
  • 40
  • While I agree that every application should take care about proper data formatting, I also think that the database could be better in supporting certain widely used operations, like case-insensitive search. The database I am dealing with belong to a large enterprise, where control over data management is not in one hand. So whatever can be configured at a database level (keys, constraints etc.) should be done there. And I find keeping two versions of text fields just because of search indicates lack of built-in db support for such functionality. – Vagif Abilov Sep 19 '11 at 11:52
  • If this is achievable, it's great. My experience from various projects is that in large organizations where teams are spread in different locations and use different languages and platforma it's difficult to enforce standards with high focus on discipline. If there are 10 departments that have direct access to a database, then sooner or later somebody will disobey the rules unless they are implemented in forms of constraints. So the closer the constraints are placed to the database, the better it is for data consistency. – Vagif Abilov Sep 20 '11 at 06:19
  • Constraints at the db level is great, but can't alone solve a company's data cleansing strategy. If you can't even rely on your data to conform to basic cleansing rules (like uppercase), then what about control chars, or other concerns related to data like validating urls or emails or phone numbers? Your company should have a strategy to cleanse and/or format data in a consistent and expected way is my point (be it in a db package or external service). Otherwise, you risk turning your db into a big dumb dumping ground of "stuff" that users (apps, etc) must deal with later. – tbone Sep 20 '11 at 11:15