After having endlessly tried to handle large (3-35gb) csv files in R, I have moved over to SQL for handling these datasets. So I am using this code within an R environment (using the using the SQlite based RSQLite package) but it should not detract from my SQL question!
My Question: How do I make a selection of one table based on matching values given in another table?
I would Like to explain by example. I have the following table format:
"Data" Table
Symbol| Value| EX
A | 1 | N
A | 1 | N
A | 2 | T
A | 3 | N
A | 4 | N
A | 5 | N
B | 1 | P
B | 2 | P
B | 2 | N
B | 2 | N
B | 3 | P
B | 5 | P
B | 6 | T
...
I want to select all entries for which the symbol and exchange value matches according to a certain condition given in the example table below.
"Symbolexchange" Table:
Ticker| Exchange
A | N
B | P
...
(Note that symbol and ticker refer to the same attribute, also EX and Exchange refer to the same attribute)
So the output I am aiming for is so that it only keeps A entries given exchange is N etc:
Symbol| Value| EX
A | 1 | N
A | 1 | N
A | 3 | N
A | 4 | N
A | 5 | N
B | 1 | P
B | 2 | P
B | 3 | P
B | 5 | P
...
I was able to do this via two methods, although I am not quite satisfied with them.
This method adds the reference table in columns next to the original table, which is redundant.
SELECT *
FROM Data
INNER JOIN Symbolexchange
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER
This Method also gets the job done directly but is slower than the above.
SELECT *
FROM Data
WHERE EX=(SELECT exchange FROM Symbolexchange WHERE ticker = SYMBOL)
Is there a better and faster way way to program this? Speed is quite important because of the size of my data sets. Any other comments on my code welcome!
Thanks