I'm a new guy to databases, and I'm trying to figure out a good solution for dealing with large datasets. I mostly do statistical analyses using R, so I don't need a database as the backend of web pages or anything. By datasets are generally static - they are just big.
I was trying to do a simple left join of a ~10,000,000 record table on a ~1,400,000 table. The 1.4 m table had unique records. After churning for 3 hours, it quit on me. The query was specified correctly - I ran it limiting the retrievals to 1000 records and it returned exactly as I expected. Eventually, I found a way to split this up into 10 queries and it ran, but by this time, I was able to do that merge in R pretty quickly, without all the fancy calls to sqlite and indexing.
I've been looking to use databases because I thought they were faster/more effective for these basic data manipulations, but maybe I'm just overlooking something. In the above example, I had indexed in the appropriate columns, and I'm surprised that sqlite could not handle it whilst R could.
Sorry if this question is a little foggy (I'm a little foggy on databases), but if anyone has any advice on something obvious I'm doing wrong to not take advantage of the power of sqlite, that would be great. Or am I just expecting to much of of it, and a 100 m X 1.4 m record merge is just too big to execute without breaking it up?
I would think that a database could outperform R in this respect?
thanks!
EXL