0

I have a large file (data.txt, 35 GB) which has 3 columns. Some example part of the file would look like the following:

...   ...    ...
5   701565  8679.56
8   1.16201e+006    3193.18
1   1.16173e+006    4457.85
14  1.16173e+006    4457.85
9   1.77942e+006    7208.73
4   1.78011e+006    8239.88
14  1.78019e+006    8195.57
9   2.00206e+006    8858.55
4   2.00199e+006    7924
...   ...    ...

I want to plot a histogram for the 3rd column when the values in the second column are between 0 and 50'000.

Then I want to do another histogram where the values of the first column are between 50'000 and 100'000. And so on, and so forth.

I don't know how to load/read only the data I need at a time. Any help would be appreciated!

If I should use the sqldf package then the question I have would be how I can say that the value of the 2nd column should be smaller than a e.g. 50'000?

The difference to How do i read only lines that fulfil a condition from a csv into R? is that I don't have any column names. Therefore I cannot do what they propose in their solution:

sql = "select * from file where Sepal.Length > 5"

Community
  • 1
  • 1
Fabi
  • 71
  • 1
  • 8
  • Maybe you can use the `sqldf` package for that. See [these search results](http://stackoverflow.com/search?q=%5Br%5D+filter+sqldf+where) for some inspiration. – Jaap Oct 28 '16 at 10:49
  • I don't have any column name. Therefore I don't know how to use sqldf to subset? – Fabi Oct 28 '16 at 10:56
  • Have you considered just naming the columns in the csv file? The structure of a csv is pretty simple and you can probably just use Notepad++... – JBGruber Oct 28 '16 at 12:43
  • For example, just add `"","column 1","column 2"` as first line – JBGruber Oct 28 '16 at 12:44
  • @JonGrub I'm not sure how to do that. I cannot just open the file since it is 42 GB. It's too large. – Fabi Oct 28 '16 at 13:54
  • 1
    You can try: 1. create new txt file (called e.g aaa so it is on top in a folder) 2. Open and place the first line from above 3. rename file from txt to extension csv 4. put both files in one folder 5. merge files via command (windows e.g. [link](http://www.tomnash.eu/how-to-combine-multiple-csv-files-into-one-using-cmd/)). This creates a new csv file with proper columns. Bit hacky but I used it before in a similar problem. – JBGruber Oct 28 '16 at 14:04
  • 1
    @Fabi; look at Dirk's answer in the linked duplicate - it does not use column headers – user20650 Oct 29 '16 at 16:32

1 Answers1

1

I think recent versions of readr support this sort of thing. The following is just adapted from the help for readr::read_csv_chunked

library(readr)

f <- function(x, pos) subset(x, X3 > 0 & X3 < 50000)
df <- read_csv_chunked(
  'test.csv', 
  DataFrameCallback$new(f),
  chunk_size = 100000,
  col_names = F
)
Eric
  • 3,403
  • 1
  • 19
  • 18