3

I'm a new guy to databases, and I'm trying to figure out a good solution for dealing with large datasets. I mostly do statistical analyses using R, so I don't need a database as the backend of web pages or anything. By datasets are generally static - they are just big.

I was trying to do a simple left join of a ~10,000,000 record table on a ~1,400,000 table. The 1.4 m table had unique records. After churning for 3 hours, it quit on me. The query was specified correctly - I ran it limiting the retrievals to 1000 records and it returned exactly as I expected. Eventually, I found a way to split this up into 10 queries and it ran, but by this time, I was able to do that merge in R pretty quickly, without all the fancy calls to sqlite and indexing.

I've been looking to use databases because I thought they were faster/more effective for these basic data manipulations, but maybe I'm just overlooking something. In the above example, I had indexed in the appropriate columns, and I'm surprised that sqlite could not handle it whilst R could.

Sorry if this question is a little foggy (I'm a little foggy on databases), but if anyone has any advice on something obvious I'm doing wrong to not take advantage of the power of sqlite, that would be great. Or am I just expecting to much of of it, and a 100 m X 1.4 m record merge is just too big to execute without breaking it up?

I would think that a database could outperform R in this respect?

thanks!

EXL

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
exl
  • 1,743
  • 2
  • 18
  • 27
  • 1
    I think there's something wrong with your query (in terms of efficiency). If you post it, someone should be able to help. – Richard Herron Nov 28 '10 at 23:34
  • Are you indexing your tables? eg in the table schema, have something like "CREATE INDEX mapping_ind ON mapping (id);" - I found that by creating indexes on the fields I wanted to join by sped up sqlite3 joins massively. – Aaron Statham Nov 29 '10 at 12:18
  • As has been pointed out it will generally come down to your table/index setup in relationship to your query. – Jay Jan 27 '11 at 18:54

2 Answers2

3

I am going through the same process. If you look through the questions I've asked recently, you may get some good pointers, or at least avoid a lot of the time I've wasted :). In short, here's what's been most helpful to me.

-- the RSQLite package

-- the RSQLite.extfuns package

-- the SQLite FAQ

I'm still a newbie, but in general, you should be using SQLite for subsetting data that is too large to bring in to RAM. I would think that if the data are small enough to handle in RAM, then you're better off using the native R tools for joins/subsets. If you find that you become more comfortable with SQL queries, then there is the sqldf package. Also, JD Long has a great discussion on using sqldf with large datasets.

Community
  • 1
  • 1
Richard Herron
  • 9,760
  • 12
  • 69
  • 116
  • 1
    If you are giving advice to R newbies it's a good idea to use proper capitalization, so I suggest you edit your answer to spell the package, "sqldf'. There are also a significant number of gurus that get their hackles up about referring to packages as libraries. The library is where the packages reside. (I'm not one of the sensitive types on this point, but hoped it was helpful to warn you.) – IRTFM Nov 29 '10 at 04:32
  • @JD -- NP. Credit where credit is due. That post really helped me. – Richard Herron Nov 30 '10 at 02:35
2

I have to admit that I'm surprised that this has been a problem for you. SQLite has always worked well for me, at least speed-wise. However -- SQLite is easy because it is so flexible. SQLite can be dangerous because it is so flexible. SQLite tends to be very forgiving with data types. Sometimes this is an absolute god-send, when I don't want to spend a bunch of time tweaking things to perfection, but with great flexibility comes great responsibility.

I have noticed that I need to be careful moving data into SQLite. Text is easy. However, sometimes numbers get stored as text rather than numbers. Doing a JOIN on a column of numbers is faster than the same JOIN on a column of text. If your number columns are stored as text and then coerced into numbers for the comparison, you would lose most of the advantage of using an index.

I don't know how you got your data into SQLite, so the first thing I would do is look at your table schemas and make sure they make sense. And while they may seem obvious, indexes can be tricky. Taking a look at the queries might also result in something useful.

Without being able to see the underlying structure and queries, answers to this question will be educated guesses.

Choens
  • 1,312
  • 2
  • 14
  • 23
  • will try to get my structure and queries up soon, thanks. I heard that doing a join on numbers is faster than text. this is true, even when they are indexed? I eventually converted those columns that I was joining on into numerics, but I was thinking that such encoding was being done by creating an index? – exl Dec 05 '10 at 04:12
  • As they say -- It depends. An index on a column with text in it can be very effective, if the text is a factor. For example - 'M' or 'F'. An index on a string that is also a primary key, or is fairly unique to a large number of rows is effectively a waste of time and storage space. However, an index on numbers is useful, because computers can easily compare size, allowing the index to be searched in a more efficient manner. In fact - there are many efficient ways to search for numbers that don't work with text. If your text string is fairly unique per row -- your index was wasted effort. – Choens Dec 05 '10 at 22:46