10

I have a dataframe in which I want to use certain values as hash keys / dictionary keys (or whatever you call it in your language of choice) for other values in that dataframe. Say I have a dataframe like this which I've read in from a large csv file (only first row shown):

  Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
1 Plate 1_A1    QN2200   A     1.766       2.791    Both 

which in R code would be:

 structure(list(Plate.name = structure(1L, .Label = "Plate 1_A1", class = "factor"), 
    QN.number = structure(1L, .Label = "QN2200", class = "factor"), 
    Well = structure(1L, .Label = "A1", class = "factor"), Allele.X.Rn = 1.766, 
    Allele.Y.Rn = 2.791, Call = structure(1L, .Label = "Both", class = "factor")), .Names = c("Plate.name", 
"QN.number", "Well", "Allele.X.Rn", "Allele.Y.Rn", "Call"), class = "data.frame", row.names = c(NA, 
-1L))

THe QN.numbers are unique IDs in my dataset. How do I then retrieve data using the QN.number as a reference for the other values, that is to say I want to know the Call or the Allele.X.Rn for a given QN.number? It seems row.names might do the trick but then how would I use them in this instance?

arandomlypickedname
  • 1,349
  • 1
  • 11
  • 12
  • +1 for a reproducible example and well-asked question. – Ari B. Friedman Jul 25 '11 at 10:12
  • Row names in a data frame must me unique (as in a hash or dictionary), so you might want to do that check before you use QN.number in row.names(). Something like `sum(tapply(d$a,d$a,length)>1)` will tell you how many duplicates are in column a of data frame d. –  Jul 25 '11 at 16:40
  • Ah yes thanks Seth, I should have mentioned that the QN.number is a unique ID. I'll edit the question – arandomlypickedname Aug 02 '11 at 04:15

3 Answers3

5

Using row.names is like this:

> row.names(d)=d$QN.number
> d["QN2200",]
       Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
QN2200 Plate 1_A1    QN2200   A1       1.766       2.791 Both
> d["QN2201",]
   Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
NA       <NA>      <NA> <NA>          NA          NA <NA>

You just use the row name as the first parameter in the subsetting. You can also use multiple row names:

> d=data.frame(a=letters[1:10],b=runif(10))
> row.names(d)=d$a
> d[c("a","g","d"),]
  a         b
a a 0.6434431
g g 0.6724661
d d 0.9826392

Now I'm not sure how clever this is, and whether it does sequential search for each row name or faster indexing...

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • This method may give unexpected results due to row.names partial matching. More info here => https://stackoverflow.com/questions/34233235/r-returning-partial-matching-of-row-names – 3r1d May 17 '23 at 18:03
4

Use subset.

 subset(your_data, QN.number == "QN2200", Allele.X.Rn)

with provides an alternative; here the output is a vector rather than another data frame.

with(your_data, Allele.X.Rn[QN.number == "QN2200"])
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • I can get subset to work on the limited test dataset I've provided but I can't get it for the life of me to work on the real data set: the dreaded "undefined columns selected" error – arandomlypickedname Aug 02 '11 at 08:55
1

Assuming that we're storing our data frame in a variable name--I'll call it dataframe for now--the following should do it:

dataframe$Allele.X.Rn[which(dataframe$Qn.number == <whatever>)]

Where, of course <whatever> is the number that you'd like to use for Qn.number.