0

Good morning all

I am still newish to R, and I searched most forums for an answer to my problem (I suspect I am missing out on a crucial keyword somewhere), so apologies if I duplicate a question. My problem is similar to this question, but the answer does not quite work for me.

I have a matrix with 1.7m-odd rows, and at this point 20 columns. For the purposes of this excercise I only need to extract 20 rows from this matrix, but will need to do more than a 1000 later on. I would like to be able to import a list of all the rows I would like to subset into a smaller matrix for further analysis, and keep I hitting my head against the wall.

I have created a smaller matrix with just 2 columns of interest, and set the row names to the animal ID's. The animal ID's are unique. Apologies for the clumsy coding.

EBV<-read.csv(file='bfile.csv', header=F, skip=1, sep=',', col.names=c("animal","anim_name","byear","anim_name_pa","anim_name_ma","sex","wwdir_ebv","wwdir_acc","wwmat_ebv","wwmat_acc","afc_ebv","afc_acc","icp_ebv","icp_acc","shd_ebv","shd_acc","scr_ebv","scr_acc","adg_ebv","adg_acc"))
head(EBV)
tail(EBV)
a<-subset(EBV, select=c(animal))
b<-subset(EBV, select=c(wwdir_ebv,wwdir_acc))
c<-as.numeric(as.character(unlist(a)))
d<-as.numeric(as.character(unlist(b)))
x<-matrix(d, nrow=1708891,ncol=2, byrow=F)
rownames(x)<-c
colnames(x)<-c("wwdir_ebv","wwdir_acc")
head(x)

Results of head(x):

*row.name* wwdir_ebv wwdir_acc
33525056   12.0321        49
33702721   13.6674        46
33791336    6.8078        63
33907452   11.0981        51
33909847    7.4192        67
34165696    8.5039        42

Now what I would like to do is something like this:

EX<-read.csv(file='braz.csv', header=F, sep=',', col.names=c("Ani"))
X<-as.numeric(as.list(unlist(EX)))
z<-subset(x, select=c('X')

Where the "braz.csv" file only contains a single column, for argument's sake, with animals 33701721, 33791336 and 33909847. Extracting the animals one-by-one hasn't been too much of a problem, but typing a 1000 names one-by-one will be eventually.

I don't know it it would be more effective to keep the animalID's in a column of its own though (i.e., make a matrix of 1.7m x 3 instead of 1.7m x 2) and try to subset according to the column "animalID". My biggest concern is that list that I want to import and use for subsetting.

Thanks in advance!

Community
  • 1
  • 1
  • 1
    I don't know why you go to all that trouble of creating matrices instead of using the data.frame returned by `read.csv`. Your use of `subset` also confuses me (becuase `select` selects columns, but apparently you want to subset by rows). Could you possibly just need `x[rownames(x) %in% unlist(EX),]`? – Roland Nov 12 '14 at 11:02
  • Hi @Roland. Thanks, that solved my problem halfway. When I checked your solution I found that while it gives me the subset I am looking for, it also sorts the row names numerically. I have another matrix that I will be using for calculations with this matrix, and I do need to keep the animal orders the same. Unless I also sort the genotypic matrix... Will go and check to see if it works. – Tundrahorse Nov 12 '14 at 11:24
  • My code should not change the row order. – Roland Nov 12 '14 at 11:55
  • Hi. Checked. The other matrix is a 76883 x 20, where the columns are the animalID's. Transposing the matrix, sorting according to animalId and retransposing it seems like a waste of effort. So I went back to the source file, sorted according to aa<-a[order(a[,1],a[,2],decreasing=F),], constructed the 2nd matrix and bobsyouruncle. But for some strange reason I now found that when using y<-x[rownames(x) %in% unlist(EX),]`, it leaves-out row 15, and swops row 19 and 20 around, so I only have a 19 x 2 matrix left. The order of the file only changes when that line is run. Any ideas? – Tundrahorse Nov 12 '14 at 12:34
  • I can only point out the importance of a [minimal reprodicible example](http://stackoverflow.com/a/5963610/1412059). – Roland Nov 12 '14 at 12:39
  • Hi Roland, found the error. It was in the original dataset - animals in row 15 and row 20 of the subset were too young to have data. Thanks for all your help! Now, how do I mark your first comment as the answer? – Tundrahorse Nov 12 '14 at 12:59
  • I've copied my comment into an answer. – Roland Nov 12 '14 at 13:07

1 Answers1

1

I don't know why you go to all that trouble of creating matrices instead of using the data.frame returned by read.csv. Your use of subset also confuses me (because select selects columns, but apparently you want to subset by rows).

It appears you simply need x[rownames(x) %in% unlist(EX),]. Generally, you'll find that [ is not less convenient than subset for subsetting, but more powerful. subset can also result in trouble when used inside functions. I'd advise you to study help("["). help("%in%") might also be worth reading.

Roland
  • 127,288
  • 10
  • 191
  • 288