3

I have a huge dataset stored as .csv file, which does not fit into memory. However, I only need 3 columns from the data set which would fit into memory. How can I load them?

UPD: Ho can I choose columns by names rather than by their column indices? I do not know their indices a priori

P.Escondido
  • 3,373
  • 6
  • 23
  • 29
  • You could take a look at the link: http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r where I have posted another solution. – Rahul Premraj Mar 15 '14 at 13:33

3 Answers3

5

If your table is very large, consider using the data.table package:

# create an example: 10,000 rows by 100 columns
df <- data.frame(matrix(rnorm(1e6),ncol=100))
write.csv(df,"sample.csv",row.names=F)

library(data.table)
dt <- fread("sample.csv",select=c(3,8,20))
head(dt)
#            X3         X8         X20
# 1:  0.5537762  1.0271272 -0.14437400
# 2: -0.4111327 -0.2297311 -1.04998490
# 3: -1.2540440  0.6977565 -0.21514021
# 4: -1.1500974 -0.3181102 -0.07910133
# 5: -0.6549245  1.8385510  0.73741980
# 6:  0.8049360  0.4722533 -0.65750679

This reads in just columns 3, 8, and 20 and is extremely fast.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
3

Can you pre-process using awk before loading into R? If so, say you want columns 2,3 and 5, you can do:

awk '{print $2,$3,$5}' yourfile.csv > cols23and5.csv

If your CSV file is quoted (e.g. because some fields contain commas), and looks like this:

"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"

you can remove the double quotes and change the field separator from commas into, say colons, like this:

sed -e 's/","/:/g' -e 's/"//g' yourfile.csv > ColonSeparated.csv

so that your file becomes:

Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,

then you can process it with awk using the colon as a separator and without the embedded commas worrying you:

awk -F: '{print $2,FS,$3,FS,$4}' ColonSeparated.csv   > SmallFileForR.csv
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • I have quoted strings in my file, therefore awk screws up the job :( – P.Escondido Mar 20 '14 at 20:05
  • I have updated my answer to remove double and single quotes - please have another look. – Mark Setchell Mar 20 '14 at 21:00
  • sorry, I meant that one of the columns is the string of the form "hello, world, hello". The separator is "," but it can also be a part of that string. Perhaps one can change the separator to something that never shows up - e.g. #$? – P.Escondido Mar 20 '14 at 21:12
  • I don't know if you can edit your question now it has been marked as a duplicate. If you can, it would help a lot if you could show 3-5 lines of your data and say which fields you want. If you can't, you are welcome to click on my profile and find my Email address there and send it to me - although it is just about bedtime in the UK, so I may only answer tomorrow. – Mark Setchell Mar 20 '14 at 21:17
  • I have updated my answer to show how you can handle double-quoted fields in your CSV file. – Mark Setchell Mar 21 '14 at 10:21
0

on linux, give a chance to 'awk'. it manipulates files like registers.

you can check Using AWK on CSV Files for some strategies.

polras
  • 431
  • 4
  • 7