2

I'd like to use a lookup table to add a column with additional data using tidyr/dplyr functions. I've found some basic examples that perform this operation but can't get it to work with my data and I'm not really understanding the functions as the responses don't explain what is happening.

I want to match the id column in this data frame:

>df
            id sample_name       fpkm    conf_hi    conf_lo quant_status
1  XLOC_000118          T1  33.857900  62.323300  5.3925000           OK
2  XLOC_000118          T2 169.793000 395.783000  0.0000000           OK
3  XLOC_000118          T3  41.869200  69.395700 14.3427000           OK
4  XLOC_009095          T1   1.472500   3.076350  0.0000000           OK
5  XLOC_009095          T2   3.828400   8.171850  0.0000000           OK
6  XLOC_009095          T3   1.806010   4.055220  0.0000000           OK

...to the same values in this lookup table and add the name value to df in a new column where lookupTable$name matches df$id:

>lookupTable
           id name
1 XLOC_000118 Xy13
2 XLOC_009104  Xy3
3 XLOC_009105  Zy3
4 XLOC_009095  Xy6
5 XLOC_018501  Xy9
6 XLOC_020049 Xy35

I attempted to adapt the code from this question but receive errors:

df %>%
  gather(key = "col") %>% 
  left_join(ObLookup, by = "id") %>%
  spread(key = id, value = name)

Error: `by` can't contain join column `id` which is missing from LHS
In addition: Warning message:
attributes are not identical across measure variables;
they will be dropped 

I figured the following solution out on my own which produces the result that I want, but I would like to know if there is a solution using tidyr or dplyr:

> df$names <- lookupTable$name[match(df$id, lookupTable$id)]
> df
            id sample_name       fpkm    conf_hi    conf_lo quant_status names
1  XLOC_000118          T1  33.857900  62.323300  5.3925000           OK Obp13
2  XLOC_000118          T2 169.793000 395.783000  0.0000000           OK Obp13
3  XLOC_000118          T3  41.869200  69.395700 14.3427000           OK Obp13
4  XLOC_009095          T1   1.472500   3.076350  0.0000000           OK  Obp6
5  XLOC_009095          T2   3.828400   8.171850  0.0000000           OK  Obp6
6  XLOC_009095          T3   1.806010   4.055220  0.0000000           OK  Obp6
user974887
  • 2,309
  • 3
  • 17
  • 18
  • Which is the `ObComplete` dataset – akrun Jul 24 '18 at 17:29
  • 1
    df, edited the code – user974887 Jul 24 '18 at 17:30
  • You need `df %>% gather(key, val, -id) %>% left_join(lookupTable) %>% group_by(id) %>% mutate(rn = row_number()) %>% spread(id, name)` – akrun Jul 24 '18 at 17:31
  • Since you're becoming an advanced Stack Overflow asker by now, consider to provide your example data rather by posting the output of `dput(df)` than copy-paste. This makes us a lot easier to help you. For help on this topic, you may want to read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 Cheers! – jay.sf Jul 24 '18 at 17:34
  • Akrun, that code adds the names, but spreads the id values as columns and introduces a bunch of NAs where there isn't corresponding values. I'd like the original structure of "df" with "name" included as a column where the "id" matches. – user974887 Jul 24 '18 at 17:37
  • Please check the solution updated. It should match the expected – akrun Jul 24 '18 at 18:12

1 Answers1

1

We may need to select the columns that of interest in the gather, and then after the left_join do the spread

library(tidyverse)
df %>%
  gather(key, val, fpkm:conf_lo) %>%
  left_join(lookupTable) %>% 
  spread(key, val)
#         id sample_name quant_status name   conf_hi conf_lo      fpkm
#1 XLOC_000118          T1           OK Xy13  62.32330  5.3925  33.85790
#2 XLOC_000118          T2           OK Xy13 395.78300  0.0000 169.79300
#3 XLOC_000118          T3           OK Xy13  69.39570 14.3427  41.86920
#4 XLOC_009095          T1           OK  Xy6   3.07635  0.0000   1.47250
#5 XLOC_009095          T2           OK  Xy6   8.17185  0.0000   3.82840
#6 XLOC_009095          T3           OK  Xy6   4.05522  0.0000   1.80601
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This code produces the same table full of NAs, not sure what you mean in your description. I edited the question to show a solution with the table I'm looking for. – user974887 Jul 24 '18 at 17:50
  • @user974887 Ok, I changed the code. I think you need to `spread` by different column – akrun Jul 24 '18 at 17:56