Relational database versus R/Python data frames

Question

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.

Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).

For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.

The questions that keeps arising is: When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?

For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey

How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.

score 4 · Accepted Answer · answered May 15 '15 at 12:57

What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.

The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.

I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:

A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.

Some of the benefits of this might include:

Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.

Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

What are the open source tools that I can use for ETL and data warehousing in my case? I am doing reaearch with data processing and storage solely my responsibility in the project along with data analysis. So, the focus is on data analysis rather than sharing data with others. Your solution sounds interesting and I should try it. I looked this site http://butleranalytics.com/5-free-open-source-etl-tools/ and found open source ETL tools like TALEND which I should try out. But end of the day data analysis and ML can be done on a single spreadsheet of inputs and outputs, so what do you feel. — KarthikS, May 15 '15 at 14:56
@Earnest_learner Glad the answer helped. :) Part of the reason I was avoiding any recommendations is that it's against SO's rules to ask for software recommendations and the like as they tend to lapse into spam or just be too opinion-based. Also, I have very little experience of open source ETL tools (I use a combination of SSIS and SQL/T-SQL on SQL Server). Keep in mind that ETL can be done in code (SQL or something else) - I have heard examples of people using Python, but I'm not sure how it would perform. What is suitable (and feasible) really comes down to your specific circumstances. — Jo Douglass, May 15 '15 at 15:12
Okay. Yes, that sounds right. I do not wish to go for the over-kill. But also, I should not circumvent any basic processes. I guess people use Python and R to make data objects or link it to databases for data analysis. But programming is convenient, customizable for data processing but not repeatable and error prone, thats my only worry. — KarthikS, May 15 '15 at 15:23

Relational database versus R/Python data frames

1 Answers1