4

After having endlessly tried to handle large (3-35gb) csv files in R, I have moved over to SQL for handling these datasets. So I am using this code within an R environment (using the using the SQlite based RSQLite package) but it should not detract from my SQL question!

My Question: How do I make a selection of one table based on matching values given in another table?

I would Like to explain by example. I have the following table format:

"Data" Table

Symbol| Value| EX
A  | 1  | N       
A  | 1  | N     
A  | 2  | T  
A  | 3  | N  
A  | 4  | N  
A  | 5  | N  
B  | 1  | P       
B  | 2  | P  
B  | 2  | N  
B  | 2  | N  
B  | 3  | P  
B  | 5  | P  
B  | 6  | T  
... 

I want to select all entries for which the symbol and exchange value matches according to a certain condition given in the example table below.

"Symbolexchange" Table:

Ticker| Exchange
A  | N       
B  | P  
... 

(Note that symbol and ticker refer to the same attribute, also EX and Exchange refer to the same attribute)

So the output I am aiming for is so that it only keeps A entries given exchange is N etc:

Symbol| Value| EX
A  | 1  | N       
A  | 1  | N     
A  | 3  | N  
A  | 4  | N  
A  | 5  | N  
B  | 1  | P       
B  | 2  | P  
B  | 3  | P  
B  | 5  | P  
... 

I was able to do this via two methods, although I am not quite satisfied with them.

This method adds the reference table in columns next to the original table, which is redundant.

SELECT *
FROM Data
INNER JOIN Symbolexchange 
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER

This Method also gets the job done directly but is slower than the above.

SELECT *
FROM Data
WHERE EX=(SELECT exchange FROM Symbolexchange WHERE ticker = SYMBOL)

Is there a better and faster way way to program this? Speed is quite important because of the size of my data sets. Any other comments on my code welcome!

Thanks

Hugstime
  • 131
  • 1
  • 9
  • The second sql statement will match only on `EX`. Is that you want? You do not need to match `Symbol` and `Ticker`? – unlimit Oct 30 '13 at 16:21
  • Yes you are right, I only want to match on `EX` based on the table Symbolexchange. So for ticker A, I only want to select rows for which Exchange is N. For ticker B, I only want to select rows for which Exchange is P. etc. – Hugstime Oct 30 '13 at 16:26
  • 2
    Two questions: 1) what DBMS is this? (performance advice varies a lot across different ones), and 2) Do you have any keys or indexes defined on either of these tables? – RBarryYoung Oct 30 '13 at 16:33
  • 1) SQLite 2) No (not yet? :)) – Hugstime Oct 30 '13 at 16:37
  • 1
    I think the first sql will be efficient. I cannot think of anything that will be more efficient right now. You could also improve performance by adding proper indexes. You can remove the redundant columns by not doing a `Select *` and doing `Select data.symbol, data.value, data.ex`. – unlimit Oct 30 '13 at 16:41

2 Answers2

3

Two things that you can do to improve performance:

First (and most importantly) add a key or index to your tables. I don't know SQLite, but usually there's a command something like this:

CREATE INDEX DataIX1 ON Data(Symbol,EX)

You'll want one on the other table too:

CREATE INDEX SymbolExchangeIX1 ON Symbolexchange(Ticker,Exchange)

You may need to throw in ".." or '..' on the names...

The second thing is that although your first query is probably your best approach, you should only return the columns that you actually need/want:

SELECT Data.*
FROM Data
INNER JOIN Symbolexchange 
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER
RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
  • 1
    Thank you! I timed this in R and this sped up my code from 9.74 seconds down to 0.40 second for a subsection of my data. – Hugstime Oct 30 '13 at 17:32
-1

I am not sure if you are using mysql or MS SQL. For MS SQL, you make your query faster by adding a no lock to your queries.

1) WITH (NOLOCK)

Select * from user with (NOLOCK)

OR

2) SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select * from user a, class b where a.userid=b.userid

You can refer to previously discussed topic on this below. WITH (NOLOCK) vs SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

Community
  • 1
  • 1
Dillon Tan
  • 19
  • 2