-2

I have large amount of Data, about 20 million rows and 6 columns. I am trying to extract data from this large .csv file. I tried R, but i get error msg, I am using macbook with 4 GB Ram, i5 processor. Is there any way I can extract information, I tried excel, it can take only 1 million rows. any advise or help will be useful

file is more than 1.3 GB, i want to divide this data base into set of about 2000-3000 based on a parameter. I tried R and when I used read.csv.. i tries for a moment but but after 10 mints or so i get R not responding –

I want to separate these data based on 3rd column.

SHA PCT PRACTICE BNF CODE BNF NAME

  • 6
    Have a look at http://stackoverflow.com/questions/3094866/trimming-a-huge-3-5-gb-csv-file-to-read-into-r, in addition, right now your question lacks the required detail. Please help us help you by providing us with a reproducible example (i.e. code and example data), see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for details. – Paul Hiemstra May 13 '13 at 09:39
  • How big is the file and what do you wan't to do with the information from it? – Kuchi May 13 '13 at 09:39
  • What do you need to do. Extract some rows according to some criteria or process each row of the csv and then calculate some results? – nakosspy May 13 '13 at 09:39
  • 20 million x 6 columns x 8 bytes for double ~ 1GB, so it should fit easily. What have you tried? – themel May 13 '13 at 09:40
  • @themel as a rule of thumb, R needs 3 times the amount of memory that is needed for the object because of copying and such, so 1GB is pushing it a bit, depending on what the OP is doing. – Paul Hiemstra May 13 '13 at 09:42
  • file is more than 1.3 GB, i want to divide this data base into set of about 2000-3000 based on a parameter. I tried R and when I used read.csv.. i tries for a moment but but after 10 mints or so i get R not responding – Manish Jain May 13 '13 at 09:45
  • post a reproducible example – Nishanth May 13 '13 at 09:50
  • can you post first 5 rows from the file? you want to subset based on 3rd column? – Nishanth May 13 '13 at 10:36
  • This can probably be done with a few lines of `awk`. `awk '$3="foo"' < data.csv >foo.csv` will select all lines with the third item equal to "foo". Learn some new tools. – Spacedman May 16 '13 at 06:31

2 Answers2

5

First of all you have to tell what do you mean by extract data. If it is some sort of aggregation functions or it can be divided, than I think that the easiest way is to split you huge csv file into many small one.

If you need something else, than have a look here:

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
0

I would dump it into a SQL database (mySQL,postgre,SQLlite and make a call using the ODBC driver that you can find in the RODBC package (JDBC also works).

You can then do a 'SELECT * FROM your_table WHERE column_3= X;'

Good luck!

[Link to tutorial1

Guillaume
  • 1,277
  • 2
  • 13
  • 21