4

I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:

Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2

Apologies in advance if this is a foolish question.

user702432
  • 11,898
  • 21
  • 55
  • 70
  • 1
    See `?system` for how to run shell commands from within R. – Joshua Ulrich Oct 25 '11 at 16:15
  • @Joshua: In my opinion posting this as an answer would be better practice. It would cause the display of [r] questions to have a non-zero answer and allow it to be accepted. – IRTFM Oct 25 '11 at 16:18
  • Maybe. I always feel a little guilty posting a super-short answer or one that I haven't explained in detail, so I leave it as a comment and let someone else (or the OP) re-post with more details as an answer ... – Ben Bolker Oct 25 '11 at 18:18

5 Answers5

8

Try this (adding other read.table arguments if needed):

# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))

or in pure R:

# 2
DF <- read.table("data.txt")[1:2]

or to not even read the unwanted fields assuming there are 4 fields:

# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))

The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:

# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))

The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :

# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Great answers, all. I wish I could check both Grothendieck's and Dirk's responses as accepted. Many thanks. – user702432 Oct 25 '11 at 16:36
6

I think you may be looking for littler which integrates R into the Unix command-line pipelines.

Here is a simple example computing the file size distribution of of /bin:

edd@max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      4    5736   23580   61180   55820 1965000       1 

  The decimal point is 5 digit(s) to the right of the |

   0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
   1 | 01111112233459
   2 | 3
   3 | 15
   4 | 
   5 | 
   6 | 
   7 | 
   8 | 
   9 | 5
  10 | 
  11 | 
  12 | 
  13 | 
  14 | 
  15 | 
  16 | 
  17 | 
  18 | 
  19 | 6

edd@max:~/svn/littler/examples$ 

and it takes for that is three lines:

edd@max:~/svn/littler/examples$ cat fsizes.r 
#!/usr/bin/r -i

fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
3

See ?system for how to run shell commands from within R.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
3

Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:

You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...

It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages

Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.

Rainer
  • 8,347
  • 1
  • 23
  • 28
1

Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.

However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111