1

I faced a problem with a data loss, caused by a wrong query.
Data restored, but now I'd like to understand the problem.

I encountered the problem on SQL Server 2014, but I replicated it on SQL Server 2000 and PostgreSQL. Specifically, there was a DELETE. In the following scenario I use a SELECT.

The tables creation for sql server 2014:

CREATE TABLE [dbo].[tmp_color](
    [color_id] [int] NOT NULL,
    [color_name] [nvarchar](50) NOT NULL,
    [color_cat] [int] NOT NULL,
 CONSTRAINT [PK_tmp_color] PRIMARY KEY CLUSTERED (
    [color_id] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
      , ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[tmp_color_cat](
    [catid] [int] NOT NULL,
    [catname] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_tmp_color_cat] PRIMARY KEY CLUSTERED (
    [catid] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
      , ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

And the Postgres version:

CREATE TABLE tmp_color (
  color_id integer NOT NULL,
  color_name text,
  color_cat integer,
  CONSTRAINT tmp_color_pkey PRIMARY KEY (color_id)
);

CREATE TABLE tmp_color_cat (
  catid integer NOT NULL,
  catname text,
  CONSTRAINT tmp_color_cat_pkey PRIMARY KEY (catid)
);

Data population (works on both RDBMS):

INSERT INTO tmp_color_cat (catid, catname) VALUES (1, 'magic color');
INSERT INTO tmp_color_cat (catid, catname) VALUES (2, 'normal color');

INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (1, 'red', 1);
INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (2, 'green', 2);
INSERT INTO tmp_color (color_id, color_name, color_cat) VALUES (3, 'black', 1);

The following SELECT is wrong:

SELECT color_cat
FROM tmp_color_cat;

Because color_cat does not exists in tmp_color_cat.
BUT, the moment you take this in a subquery:

SELECT * FROM tmp_color
WHERE color_cat IN(
    SELECT color_cat
    FROM tmp_color_cat
    WHERE catname = 'magic color'
);

It returns every single record from tmp_color.
The logical error in script is obvious: developer wrote the wrong column to identify category. If you are deleting records instead of selecting them, you will delete entire table. Not good.

Is this desired behavior? Or it is a consequence of subquery design?

By observing the execution plan of SQL Server, the logical operation is a Left Semi Join.

I found a couple of posts, one for PostgreSQL and one for SQL Server. Is there any good documentation I could send to the developer group explaining why this is not an error?

How can I avoid this kind of problems? My first idea is to use an alias. Aliases are good.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
user_0
  • 3,173
  • 20
  • 33
  • Your second link there seems to be exactly what you're describing so I don't really understand your question. I would advise always using aliases where more than one table is involved, even if it doesn't appear necessary it's good programming practice (as your incident illustrates) – komodosp Jul 07 '15 at 10:06
  • 3
    `color_cat` in the subselect references the column from the outer query and essentially changes the query to `where color_cat = colo_cat`. This is how the SQL standard defined the visibility rules (as explained in your first link). The best way (as you have already noted) is to *always* use aliases and fully qualified column names (`alias.column`) then this error becomes obvious. –  Jul 07 '15 at 10:12
  • 1
    Always. Use. Aliases. :) Even for query from single table - at some later time you may extend this query with another table and then you may have a problem again. – Arvo Jul 07 '15 at 10:16
  • Yes, @a_horse_with_no_name, the problem is clear. The problem in my environment was in SQL Server. Is the second link a good resource to send to developers? – user_0 Jul 07 '15 at 10:23
  • We can assume the current Postgres version 9.4? – Erwin Brandstetter Jul 07 '15 at 15:11
  • @ErwinBrandstetter, actually it is 9.3, but from my read it works the same way also on 8.x. – user_0 Jul 07 '15 at 15:22

4 Answers4

1

This is known behaviour with SQL Server. Using Aliases will prevent this

SELECT * FROM tmp_color
WHERE color_cat IN(
    SELECT A.color_cat
    FROM tmp_color_cat As A
    WHERE A.catname = 'magic color'
);

The above query will throw an error

Msg 207, Level 16, State 1, Line 3
Invalid column name 'color_cat'.
Raj
  • 10,653
  • 2
  • 45
  • 52
1

An alias in your case would have fixed the issue, as the way it's written it's simply referencing the table in the outer query tmp_color thus returning everything.

So you would rewrite to this, as you suggest:

SELECT * FROM tmp_color t1
WHERE t1.color_cat IN(
    SELECT t2.color_cat
    FROM tmp_color_cat t2
    WHERE t2.catname = 'magic color'
);

This would show that you have an error in your logic:

Invalid column name

Another safe way to write it would be to use JOIN. Note I've left off the aliases below as the JOIN specification doesn't have any columns that clash. If any column names were identical across tables, then you would get and Ambiguous column error. Best practice would be to always use aliases for clarity.

SELECT * 
FROM #tmp_color
INNER JOIN #tmp_color_cat ON color_cat = catid
WHERE catname = 'magic color'

And the equivelant DELETE would be:

DELETE t1
FROM #tmp_color t1
INNER JOIN #tmp_color_cat ON color_cat = catid
Where catname = 'magic color'

Full runnable sample:

CREATE TABLE #tmp_color
    (
      color_id INT ,
      color_name NVARCHAR(50) ,
      color_cat INT
    )

CREATE TABLE #tmp_color_cat
    (
      catid INT ,
      catname NVARCHAR(50) NOT NULL,
    )

INSERT INTO #tmp_color_cat (catid, catname) VALUES (1, 'magic color');
INSERT INTO #tmp_color_cat (catid, catname) VALUES (2, 'normal color');

INSERT INTO #tmp_color (color_id, color_name, color_cat) VALUES (1, 'red', 1);
INSERT INTO #tmp_color (color_id, color_name, color_cat) VALUES (2, 'green', 2);
INSERT INTO #tmp_color (color_id, color_name, color_cat) VALUES (3, 'black', 1);

DELETE t1
FROM #tmp_color t1
INNER JOIN #tmp_color_cat ON color_cat = catid
Where catname = 'magic color'

SELECT * 
FROM #tmp_color

DROP TABLE #tmp_color
DROP TABLE #tmp_color_cat

Produces the remaining rows:

color_id    color_name  color_cat
2           green       2
Tanner
  • 22,205
  • 9
  • 65
  • 83
1

Authoritative quote for Postgres

The scope of a subquery includes all visible columns of the outer query. Unqualified names are resolved to the inner query first, then extending the search outward.
Assign table aliases and use theses aliases to table-qualify column names to remove any ambiguity - as you hinted yourself already.

Here is an example in the Postgres manual with a definitive statement explaining the scope:

SELECT ... FROM fdt WHERE c1 IN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10)

[...]

Qualifying c1 as fdt.c1 is only necessary if c1 is also the name of a column in the derived input table of the subquery. But qualifying the column name adds clarity even when it is not needed. This example shows how the column naming scope of an outer query extends into its inner queries.

Bold emphasis mine.

There is also an example with an EXISTS semi-join in the list of examples in the same chapter of the manual. That's typically the superior alternative to WHERE x IN (subquery). But in this particular case you don't need either. See below.

One example:

DB design, naming convention

This disaster happened because of confusion about column names. A clear and consistent naming convention in your table definitions would go a long way to make that a lot less likely to happen. This is true for any RDBMS. Make them as long as necessary to be clear, but as short as possible otherwise. Whatever your policy, be consistent.

For Postgres I would suggest:

CREATE TABLE colorcat (
  colorcat_id integer NOT NULL PRIMARY KEY,
  colorcat    text UNIQUE NOT NULL
);

CREATE TABLE color (
  color_id    integer NOT NULL PRIMARY KEY,
  color       text NOT NULL,
  colorcat_id integer REFERENCES colorcat   -- assuming an FK
);
  • You already had legal, lower-case, unquoted identifiers. That's good.

  • Use a consistent policy. An inconsistent policy is worse than a bad policy. Not color_name (with underscore) vs. catname.

  • I rarely use 'name' in identifiers. It doesn't add information, just makes them longer. All identifiers are names. You chose cat_name, leaving away color, which actually carries information, and added name, which doesn't. If you have other "categories" in your DB, which is common, you'll have multiple cat_name which easily collide in bigger queries. I'd rather use colorcat (just like the table name).

  • Make the name indicate what's in the column. For the ID of a color category, colorcat_id is a good choice. id is not descriptive, colorcat would be misleading.

  • The FK column colorcat_id can have the same name as the referenced column. Both have exactly the same content. Also allows short syntax with USING in joins.

Related answer with more details:

Better query

Building on my supposed design:

SELECT c.*
FROM   colorcat cc
JOIN   color c USING (colorcat_id)
WHERE  cc.colorcat = 'magic color';

This is assuming a 1:n relationship between colorcat and color (which you did not specify, but seems likely).

Less publicly known (since the syntax is different in other RDBMS like SQL Server), you can join in additional tables in a DELETE as well:

DELETE FROM color c
USING  colorcat cc
WHERE  cc.colorcat = 'magic color'
AND    cc.colorcat_id = c.colorcat_id;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

Server tries to figure out is column name mentioned exists in any of tables/views/subqueries in a scope of your SQL statement.

In fact, it’s better to use aliases to avoid mistakes and misunderstandings like this:

  SELECT * FROM tmp_color tc
  WHERE color_cat IN(
      SELECT tcc.catid
      FROM tmp_color_cat tcc
      WHERE catname = 'magic color'
  );

So, if you try to use construction like that:

SELECT * FROM tmp_color tc
WHERE color_cat IN(
    SELECT tcc.color_cat
    FROM tmp_color_cat tcc
    WHERE catname = 'magic color'
);

You’ll have an error message:

Msg 207, Level 16, State 1, Line 3 Invalid column name 'color_cat'.

Stepan Novikov
  • 1,402
  • 12
  • 22