0

I have the following data:

#create file from which to sample from
x5 <- sample(1:100, 1000, replace=T)
#convert to data frame
x5 = as.data.frame(x5)

#create first file (take a random sample from the created file)
a = sample_n(x5, 900)
#create second file (take a random sample from the created file)
b = sample_n(x5, 900)

#combine
c = cbind(a,b)
#create dataframe
c = data.frame(c)
#rename column names
colnames(c) <- c("a","b")


    a  b
1   8 10
2  48 51
3 100 74
4  55  2
5  68 74
6   3 41

I also have a lookup file that contains an ID for each value:

l = length( unique(c((c$a),(c$b))))
lookup = data.frame( value = c(unique(c$a), unique(c$b)), id = as.integer(rnorm(l, 1000000,10000)))

 value      id
1    96 1017118
2    48  996872
3    51  974430
4    27  987184
5    52  985207
6    39 1009441

Is there a quick way that I can replace the entries in "c" with the values of "id" within the "lookup" table?

I would have try to do this with merges in R:

part_1 = data.frame(a = c$a)
part_2 = data.frame(b = c$b)

merge_1 =  merge(x = part_1 , y = lookup, by.x = "a", by.y = "value", all.x = TRUE)
merge_2 =  merge(x = part_2, y = lookup, by.x = "b", by.y = "value", all.x = TRUE)

new_file = data.frame(id_1 = merge_1$id, id_2 = merge_2$id)

But this did not work and just made everything equal to each other:

> head(new_file)
    id_1   id_2
1 987984 987984
2 998972 998972
3 987984 987984
4 998972 998972
5 987984 987984
6 998972 998972

Can someone please show me how to fix this problem?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    Minor comment: `cbind` is a bad way to create data frames. `data.frame` is a much better way. Your 3 lines `c = cbind(a,b)` `c = data.frame(c)` `colnames(c) <- c("a","b")`. Can be condensed to `c = data.frame(a, b)`. The names will be `a` and `b` automatically, you don't first create a matrix potentially messing up column classes, it's shorter, more direct, and safer. – Gregor Thomas Aug 08 '22 at 16:51
  • 1
    to your lookup, `names(lookup$value) <- lookup$id`, perhaps, which doesn't address @GregorThomas comment below. – Chris Aug 08 '22 at 16:55
  • 1
    Also note that your lookup construction is not good: `c(unique(c$a), unique(c$b))` will have duplicate `value` entries with different `id`s if there are any values in both `a` and `b`. Not sure if that's a feature of your example only or a problem in your real data. – Gregor Thomas Aug 08 '22 at 16:56
  • 1
    But for lookups on a single value, named vectors or `match` are the common solutions. [This FAQ gives a nice example of the match method](https://stackoverflow.com/a/6112351/903061). – Gregor Thomas Aug 08 '22 at 16:57
  • @ Gregor Thomas: thank you for your suggestions! I will keep them in mind! – stats_noob Aug 08 '22 at 17:38

0 Answers0