How do I use elements of a dataframe like hash keys / dictionary keys / primary keys?

Question

I have a dataframe in which I want to use certain values as hash keys / dictionary keys (or whatever you call it in your language of choice) for other values in that dataframe. Say I have a dataframe like this which I've read in from a large csv file (only first row shown):

  Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
1 Plate 1_A1    QN2200   A     1.766       2.791    Both

which in R code would be:

 structure(list(Plate.name = structure(1L, .Label = "Plate 1_A1", class = "factor"), 
    QN.number = structure(1L, .Label = "QN2200", class = "factor"), 
    Well = structure(1L, .Label = "A1", class = "factor"), Allele.X.Rn = 1.766, 
    Allele.Y.Rn = 2.791, Call = structure(1L, .Label = "Both", class = "factor")), .Names = c("Plate.name", 
"QN.number", "Well", "Allele.X.Rn", "Allele.Y.Rn", "Call"), class = "data.frame", row.names = c(NA, 
-1L))

THe QN.numbers are unique IDs in my dataset. How do I then retrieve data using the QN.number as a reference for the other values, that is to say I want to know the Call or the Allele.X.Rn for a given QN.number? It seems row.names might do the trick but then how would I use them in this instance?

Row names in a data frame must me unique (as in a hash or dictionary), so you might want to do that check before you use QN.number in row.names(). Something like `sum(tapply(d$a,d$a,length)>1)` will tell you how many duplicates are in column a of data frame d. — , Jul 25 '11 at 16:40
Ah yes thanks Seth, I should have mentioned that the QN.number is a unique ID. I'll edit the question — arandomlypickedname, Aug 02 '11 at 04:15

score 5 · Accepted Answer · answered Jul 25 '11 at 11:30

Using row.names is like this:

> row.names(d)=d$QN.number
> d["QN2200",]
       Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
QN2200 Plate 1_A1    QN2200   A1       1.766       2.791 Both
> d["QN2201",]
   Plate.name QN.number Well Allele.X.Rn Allele.Y.Rn Call
NA       <NA>      <NA> <NA>          NA          NA <NA>

You just use the row name as the first parameter in the subsetting. You can also use multiple row names:

> d=data.frame(a=letters[1:10],b=runif(10))
> row.names(d)=d$a
> d[c("a","g","d"),]
  a         b
a a 0.6434431
g g 0.6724661
d d 0.9826392

Now I'm not sure how clever this is, and whether it does sequential search for each row name or faster indexing...

This method may give unexpected results due to row.names partial matching. More info here => https://stackoverflow.com/questions/34233235/r-returning-partial-matching-of-row-names — 3r1d, May 17 '23 at 18:03

score 4 · Answer 2 · answered Jul 25 '11 at 10:08

4

Use subset.

 subset(your_data, QN.number == "QN2200", Allele.X.Rn)

with provides an alternative; here the output is a vector rather than another data frame.

with(your_data, Allele.X.Rn[QN.number == "QN2200"])

answered Jul 25 '11 at 10:08

Richie Cotton

118,240
47
247
360

I can get subset to work on the limited test dataset I've provided but I can't get it for the life of me to work on the real data set: the dreaded "undefined columns selected" error – arandomlypickedname Aug 02 '11 at 08:55

score 1 · Answer 3 · answered Jul 25 '11 at 10:08

1

Assuming that we're storing our data frame in a variable name--I'll call it dataframe for now--the following should do it:

dataframe$Allele.X.Rn[which(dataframe$Qn.number == <whatever>)]

Where, of course <whatever> is the number that you'd like to use for Qn.number.

answered Jul 25 '11 at 10:08

2

You don't need the call to `which`; logical indexing is fine. – Richie Cotton Jul 25 '11 at 10:09
2

Another option would be: `dataframe[dataframe$QN.number == "QN2200", "Allele.Y.Rn"]`. – Roman Luštrik Jul 25 '11 at 11:40

How do I use elements of a dataframe like hash keys / dictionary keys / primary keys?

3 Answers3