0

My join is producing results that I don't understand.

If it's important or relevant, I'm using Parquet Tables in Impala.

What I'm doing is:

create table test1(foo string, bar int) stored as parquet;
create table test2(foo string, bar int) stored as parquet;
insert into test1 values ("something1",1);
insert into test2 values ("something2",2);

Checking to make sure that works:

 select * from test1;

Gives me the output:

+----------------------+
| foo      | bar |
+----------------------+
| something1 | 1     |
+----------------------+
1 rows

And

 select * from test2;

Gives me the output:

+----------------------+
| foo      | bar |
+----------------------+
| something2 | 2     |
+----------------------+
1 rows

Everything seems fine so far. But now when I try to join these 2 tables with

 select * from test1 left outer join test2 using (foo);

I get:

+---------------------------------------------+
| foo      | bar | foo      | bar |
+---------------------------------------------+
| something1 | 1     | something2 | 2     |
+---------------------------------------------+
1 rows

That's unexpected. I expected the output to be something1, 1, null, null. Shouldn't the join only happen when test1.foo = test2.foo?

I also tried doing this with syntax join on test1.foo = test2.foo as well as with an inner join and saw the same results.

Can someone please explain to me what's going on here? I reread the documentation and I don't understand why this is happening.

tehc0w
  • 11
  • 2
  • You are doing an left outer join, you want an inner join surely? You are also using a column that doesn't match... Hence the result set when using an outer join – Charleh Jul 22 '17 at 21:34
  • I want to do a left outer join. This is a boiled down version of something I'm observing in a much larger data set. Trying to replicate the issue with simplest and smallest data set of 2 tables with 1 row each. And as mentioned, if I use inner join, the output is the same – tehc0w Jul 22 '17 at 21:44
  • I think you have over-simplified the problem. As stated, this is ludicrous behavior and would be indicative of a bug in the database. – Gordon Linoff Jul 22 '17 at 22:42
  • This is literally the input and output. Unless oversimplification could cause this ludicrous behavior, I was also thinking something is fundamentally wrong. For what it's worth, in the larger data set, this mismatch of join only happens once every few million rows – tehc0w Jul 23 '17 at 05:44
  • `select * from test1 left outer join test2 using (foo)` should return columns `foo, bar, bar`; you should "expect" `(something1,1,null)`. Are you sure you are catching all warnings & errors? – philipxy Jul 24 '17 at 03:41

2 Answers2

0

Try this

select * from test1 left outer join test2 on test1.foo=test2.foo
Eid Morsy
  • 966
  • 6
  • 8
  • As mentioned in my second to last paragraph, I tried that with the same output. I thought it could have overlooked a use case of using vs on. – tehc0w Jul 23 '17 at 05:42
  • that's strange , but try to upgrade the version of Impala or use a different version – Eid Morsy Jul 23 '17 at 07:09
0

Update: it appears to be an engineering bug. Thanks for the help and feedback, all

tehc0w
  • 11
  • 2