0

I have a long (~700,000 rows) CSV with 2 columns. One column with locations, which are written as cg and 8 digits following (ex. cg12345678) and the corresponding column has a r value (just a digit between -1 and 1). The other CSV is much smaller (~20 rows) and only contains one column with some cg locations. I want to print only the r values of the big excel that correspond to the cg locations of the small excel.

Here is an abbreviated example of the 2 column CSV:

cg07881041  -0.0192398465425986
*cg03513874 -0.339360471677652
cg25458538  0.0451334622844003
*cg09261072 0.208770797055665
cg02404579  -0.0166889943192668
cg22585117  -0.340873841270817
*cg25552317 -0.0372823043801581

Here is an example of the one column CSV:

cg08829765
*cg25552317
*cg09261072
cg14370485
*cg03513874
cg10855276
cg12406992

I added stars to the 3 locations that match in this example. So I would want to print the following matrix:

Matching cg  corresponding rvalue
cg03513874  -0.339360471677652
cg09261072  0.208770797055665
cg25552317  -0.0372823043801581
hawkeye03
  • 27
  • 5

1 Answers1

0

Here's a dplyr approach:

library(dplyr)
df_1col %>%
  left_join(df_2col) %>%
  filter(!is.na(p_value))

#Joining, by = "cg"
#          cg    p_value
#1 cg25552317 -0.0372823
#2 cg09261072  0.2087708
#3 cg03513874 -0.3393605

source data:

df_2col <- read.table(
  header = T,
  stringsAsFactors = F,
  text = "cg  p_value
  cg07881041  -0.0192398465425986
cg03513874 -0.339360471677652
cg25458538  0.0451334622844003
cg09261072 0.208770797055665
cg02404579  -0.0166889943192668
cg22585117  -0.340873841270817
cg25552317 -0.0372823043801581")

df_1col <- data.frame(cg = c("cg08829765","cg25552317",
                         "cg09261072","cg14370485",
                         "cg03513874","cg10855276",
                         "cg12406992"), stringsAsFactors = F)
Jon Spring
  • 55,165
  • 4
  • 35
  • 53