I have been seeing unexpected results when selecting rows of a data frame using a vector of required row names. I've realised this is because R is allowing partial pattern recognition between data frame row names and the stings in my vector. Following the question here .. R returning partial matching of row names .. this seems to be when a string contains characters followed by numerics? The answers in the above question address a single row criteria but do not explain how to deal with row searches as a vector.
For example, if I have a data frame (df):
df<-data.frame(matrix(c(0.5,0.4,0.6,rep(0,3)), ncol=2, nrow=3))
colnames(df)<-c("pdx","primary")
rownames(df)<-c("chr6_LINC00680-GUSBP4","chr6_MIR5689HG","chr1_SPRR2")
> df
pdx primary
chr6_LINC00680-GUSBP4 0.5 0
chr6_MIR5689HG 0.4 0
chr1_SPRR2 0.6 0
And a search vector (test_vector):
test_vector<-c("chr6_MIR5689","chr6_LINC00680","chr1_SPRR2")
> test_vector
[1] "chr6_MIR5689" "chr6_LINC00680" "chr1_SPRR2"
If I search for the values of column "pdx" matching the rows in the search vector I get:
> df[test_vector,"pdx"]
[1] 0.4 0.5 0.6
Or by all columns, I get:
> df[test_vector,]
pdx primary
chr6_MIR5689HG 0.4 0
chr6_LINC00680-GUSBP4 0.5 0
chr1_SPRR2 0.6 0
If an exact match is present in the row names this doesn't occur:
df2<-data.frame(matrix(c(0.6,10,20,0.5,0.4,rep(0,5)), ncol=2, nrow=))
colnames(df2)<-c("pdx","primary")
rownames(df2)<-c("chr1_SPRR2C","chr6_LINC00680","chr6_MIR5689","chr6_LINC00680-GUSBP4","chr6_MIR5689HG")
> df2
pdx primary
chr1_SPRR2C 0.6 0
chr6_LINC00680 10.0 0
chr6_MIR5689 20.0 0
chr6_LINC00680-GUSBP4 0.5 0
chr6_MIR5689HG 0.4 0
> df2[test_vector,]
pdx primary
chr6_MIR5689 20.0 0
chr6_LINC00680 10.0 0
chr1_SPRR2C 0.6 0
I am extracting values from a data frame using a df[row-vector,column] match, where not all row names I'm searching for are present in the data frame. I need to retain this information as an NA, with matches/NAs in the same order as the initial search vector.
So ideally I would get:
> df[test_vector,"pdx"]
[1] NA NA 0.6
How can I get around this partial pattern recognition, while retaining the output in the same order as the search vector, with a search vector of ~ 10,000 elements, avoiding loops and with any elements in the vector not present in rownames(df) replaced by NA?
(Being ran with version.string R version 3.6.1 (2019-07-05))