8

I'm sorry if this is a stupid question, but I can't seem to get my head around it. I'm fairly new to SQL and this behavior would be strange in R or Pandas or other things that I'm used to using.

Basically, I have two tables in two different databases, with a common key user_id. I want to join all the columns with

SELECT * FROM db1.first_table t1 
JOIN db2.second_table t2 
ON t1.user_id = t2.user_id

Great, it works. Except there are two (identical) columns called user_id. This wouldn't really matter, except that I am doing this in pyspark and when I try to export the joined table to a flat file I get an error that two of the columns have the same name. There are work-arounds for this, but I'm just wondering if someone can explain why the join returns both user_id columns. It seems like it is an inner join so by definition the columns are identical. Why would it return both?

As a side question, is there an easy way to avoid this behavior?

Thanks in advance!

seth127
  • 2,594
  • 5
  • 30
  • 43
  • 1
    Because "select *" is returning all columns from both tables. IF you need to limit, then explicitly define the table.columns that you want – OldProgrammer Jul 25 '17 at 19:20
  • because you are using `SELECT *` and that means that it returns every column for every table in the join – Lamak Jul 25 '17 at 19:21
  • Workaround? Using `select *` is actually not very good practice and you should ALWAYS define which columns you want to retrieve. That is not a workaround - that is the way you should do it! – juergen d Jul 25 '17 at 19:27
  • My only issue issue, that both tables have literally over 1000 columns. I can list them all, but it will be a bit tedious, not to mention ugly. Can you do SELECT * EXCEPT user_id? – seth127 Jul 26 '17 at 00:42

6 Answers6

6

SELECT * returns all columns from all tables of the query. That includes both user_id columns - one from table A, one from table B.

The best practice is to list the column names you want returned specifically, though another option to shorten the list would be:

SELECT TableA.*, 
       TableB.col1, 
       TableB.col2, 
       ...rest of B columns except user_id
Aaron Dietz
  • 10,137
  • 1
  • 14
  • 26
  • This cannot possibly be the best way to do this? In data science I often have, like the person who asked this question, 1000+ columns. Good luck listing all of these manually... – Thomas Sep 28 '18 at 08:26
  • @Thomas In SQL it's the only way. There are plenty of ways to generate a column select without typing them all manually, but the question asked if, "in SQL", there is a way around this behavior, and there is not. – Aaron Dietz Oct 01 '18 at 20:59
  • What about [this answer](https://stackoverflow.com/a/41193410)? `USING` instead of `ON` seems to solve the problem? (Honest question, I am not deep enough into SQL to know any potential pitfalls) – Thomas Oct 02 '18 at 11:02
  • See also [this excellent and highly upvoted answer](https://stackoverflow.com/a/11367066/4629950) – Thomas Oct 02 '18 at 11:04
  • @Thomas Those are good points, it's just that `USING` isn't available in all flavors of SQL. I think your answer is the correct one given the `pyspark` tag. – Aaron Dietz Oct 02 '18 at 16:24
4

All of these answers (except the one that OP wrote himself) seem to assume that we are operating on really small tables where we can manually type out every column we need.

The simplest solution in PySpark would be to use the DataFrame join syntax:

df = left_df.join(right_df, ["name"])

This will not duplicate the column and behave like a pandas merge. If there is no special reason why you have to write it as an sql command, I would recommend this. Contrast this to

df = left_df.join(right_df, left.name == right.name) 

which will behave like a SQL join and keep both columns!

This also applies to Scala and R, see here.

Another solution would be to rename the second target column to something like "target_dataframe2", then joining with sql, then simply dropping "target_dataframe2" again.

Thomas
  • 4,696
  • 5
  • 36
  • 71
  • Thank you, this is helpful. Does the spark command return both user_id cols or does it operate more like a pandas merge (i.e. only keep one copy of the join keys)? – seth127 Sep 29 '18 at 12:32
  • It depends on how you use it: If you use it as described above (just the column name), it behaves like pandas. If you mention both columns like this: `left_df.name == right_df.name`, it will give you both columns. – Thomas Oct 01 '18 at 07:49
2

This is because you are using Select *. When only * is define in after SELECT, it return all the columns from both tables. you have to define column names. Always define what columns you want to display. You can do something like this:

SELECT t1.userid, t1.ColumnName1, t2.ColumnName2
FROM db1.first_table t1 
INNER JOIN db2.second_table t2 ON t1.user_id = t2.user_id

* can be used in following ways:

Following query will return all columns from both tables:

SELECT *
FROM db1.first_table t1 
INNER JOIN db2.second_table t2 ON t1.user_id = t2.user_id

Following query will return all columns from first_table table:

SELECT t1.*
FROM db1.first_table t1 
INNER JOIN db2.second_table t2 ON t1.user_id = t2.user_id

Following query will return all columns from Second_table table:

SELECT t2.*
FROM db1.first_table t1 
INNER JOIN db2.second_table t2 ON t1.user_id = t2.user_id

Also, you can get all columns from one table and certain columns from other table in this manner:

SELECT t1.*, t2.ColumnName
FROM db1.first_table t1 
INNER JOIN db2.second_table t2 ON t1.user_id = t2.user_id
Kashif Qureshi
  • 1,460
  • 2
  • 13
  • 20
2

You could reduce the number of field referencing what field you want.

Right now you have

  SELECT *

that is equal to

  SELECT t1.*, t2.*

Maybe you want something like

  SELECT t1.*, t2.field1, t2.field2 ...
Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
2

Ok, I figured out a way to do this without typing out all of the column names (as I mentioned in the comments, there are ~5k columns total).

This is specific to pyspark, but I just exported the column names to a csv and loaded them in and did the following:

with open("t1_cols.csv") as data_file:    
    t1_cols = data_file.read().split('\n')
with open("t2_cols.csv") as data_file:    
    t2_cols = data_file.read().split('\n')

sql = 'SELECT t1.user_id, t1.' + ', t1.'.join(t1_cols) + \
', t2.' + ', t2.'.join(t2_cols) + ' ' + \
'FROM db1.first_table t1 JOIN db2.second_table t2 ON t1.user_id = t2.user_id'

df = sqlContext.sql(sql)

A little obnoxious, but it worked.

Also, I accepted the first answer because all of the above answers are technically correct and that was the first one. Thanks for the help!

seth127
  • 2,594
  • 5
  • 30
  • 43
0

If you want only one column of user_id to get printed then you should use Inner join with USING keyword.

When you use USING keyword with a column name, it filters out that common column from both the tables and displays only one. But when you use ON with a condition t1.user_id = t2.user_id then it's just a coincidence of column with the same name is used in condition.

ON is also used to compare different columns of two tables so it does not filter out the columns on the basis of condition. So, if you want to display common columns only once after joining then you should use USING keyword.

Yash Paneliya
  • 11
  • 1
  • 6