0

I have been seeing unexpected results when selecting rows of a data frame using a vector of required row names. I've realised this is because R is allowing partial pattern recognition between data frame row names and the stings in my vector. Following the question here .. R returning partial matching of row names .. this seems to be when a string contains characters followed by numerics? The answers in the above question address a single row criteria but do not explain how to deal with row searches as a vector.

For example, if I have a data frame (df):

df<-data.frame(matrix(c(0.5,0.4,0.6,rep(0,3)), ncol=2, nrow=3))
colnames(df)<-c("pdx","primary")
rownames(df)<-c("chr6_LINC00680-GUSBP4","chr6_MIR5689HG","chr1_SPRR2")

> df
                          pdx primary
chr6_LINC00680-GUSBP4     0.5       0
chr6_MIR5689HG            0.4       0
chr1_SPRR2                0.6       0

And a search vector (test_vector):

test_vector<-c("chr6_MIR5689","chr6_LINC00680","chr1_SPRR2")

> test_vector
[1] "chr6_MIR5689"   "chr6_LINC00680" "chr1_SPRR2" 

If I search for the values of column "pdx" matching the rows in the search vector I get:

> df[test_vector,"pdx"]
[1] 0.4 0.5 0.6

Or by all columns, I get:

> df[test_vector,]
                          pdx primary
chr6_MIR5689HG            0.4       0
chr6_LINC00680-GUSBP4     0.5       0
chr1_SPRR2                0.6       0

If an exact match is present in the row names this doesn't occur:

df2<-data.frame(matrix(c(0.6,10,20,0.5,0.4,rep(0,5)), ncol=2, nrow=))
colnames(df2)<-c("pdx","primary")
rownames(df2)<-c("chr1_SPRR2C","chr6_LINC00680","chr6_MIR5689","chr6_LINC00680-GUSBP4","chr6_MIR5689HG")

> df2
                       pdx primary
chr1_SPRR2C            0.6       0
chr6_LINC00680        10.0       0
chr6_MIR5689          20.0       0
chr6_LINC00680-GUSBP4  0.5       0
chr6_MIR5689HG         0.4       0

> df2[test_vector,]
                pdx primary
chr6_MIR5689   20.0       0
chr6_LINC00680 10.0       0
chr1_SPRR2C     0.6       0

I am extracting values from a data frame using a df[row-vector,column] match, where not all row names I'm searching for are present in the data frame. I need to retain this information as an NA, with matches/NAs in the same order as the initial search vector.

So ideally I would get:

> df[test_vector,"pdx"]
[1] NA NA 0.6

How can I get around this partial pattern recognition, while retaining the output in the same order as the search vector, with a search vector of ~ 10,000 elements, avoiding loops and with any elements in the vector not present in rownames(df) replaced by NA?

(Being ran with version.string R version 3.6.1 (2019-07-05))

user3589420
  • 61
  • 1
  • 9
  • FYI The community has been moving away from row.names, and instead keeping the equivalent information as a proper column of the data.frame. Example of some of the relevant points: https://adv-r.hadley.nz/vectors-chap.html#rownames – s_baldur Mar 06 '20 at 16:08
  • Thank you @sindri_baldur. Out of interest how would I return a vector of NA, NA, 0.6 from df if the row names were stored as an additional column df$geneID ? – user3589420 Mar 06 '20 at 16:40

0 Answers0