2

Let this be my data:

my.data<-data.frame(name=c("a","b","b","c","c","c"))

What I need is a variable that indicates for each name, their respective relative frequency in the dataset. Essentially, this would look like that:

  name    target
1    a 0.1666667
2    b 0.3333333
3    b 0.3333333
4    c 0.5000000
5    c 0.5000000
6    c 0.5000000

What I tried is that I computed dummy variables for each name, and then based on these dummies I calculated new variables that indicate the relative frequency of each name in the dataset. See below:

temp_dummies<-data.frame(spatstat::dummify(my.data$name))
my.data<-cbind.data.frame(my.data, temp_dummies)
rm(temp_dummies)

my.data %>%
  dplyr::mutate(a_per=mean(a),
                b_per=mean(b),
                c_per=mean(c)) -> my.data

Now I need to extract the relative frequencies for each name and aggregate it back to get my target variable. I guess I should do something like this below but I don't know what to mutate.

my.data %>%
  dplyr::group_by(name) %>%
  dplyr::mutate(...) -> my.data

Questions:

  1. How would I get my target variable using dplyr? Am I on the right track?
  2. Is there an easier way to achive the same result?
  3. Might it be possible to write a function that does all of this stuff automatically? It seems like a pretty standard problem that we should be able to fix by simply applying a function(x) to name.
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
  • 1
    @NelsonGon Strange that a clear dupe target that you commented got reopened – akrun Jun 01 '19 at 15:37
  • Possible duplicate of https://stackoverflow.com/questions/27676128/calculate-relative-frequency-for-a-certain-group or https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr – akrun Jun 01 '19 at 15:38
  • Thanks for the links. I'll leave that question to you. Personally, I'd say it gets quite close to what my problem was but it's not a duplicate. But that's just me. ;-) – Dr. Fabian Habersack Jun 01 '19 at 16:38
  • 1
    @akrun [this](https://stackoverflow.com/questions/27676128/calculate-relative-frequency-for-a-certain-group) strangely gives the same proportions as in this answer. – NelsonGon Jun 01 '19 at 17:09
  • 1
    @FabianHabersack `add_count` is simply a shortcut for `group-count-ungroup`. – NelsonGon Jun 01 '19 at 17:12
  • @FabianHabersack Sure, `add_count` is as NelsonGon mentioned a group by frequency. There are multiple dupes for this, but may not be the one with `add_count` However, your question is not about `add_count` per se – akrun Jun 01 '19 at 17:32
  • @FabianHabersack Also, this can be done in `base R` similar to KoenV's answer `prop.table(table(my.data))[as.character(my.data$name)] a b b c c c 0.1666667 0.3333333 0.3333333 0.5000000 0.5000000 0.5000000` – akrun Jun 01 '19 at 17:34
  • 1
    @NelsonGon To check the revisions of this post https://stackoverflow.com/posts/56407446/revisions – akrun Jun 01 '19 at 17:50
  • 1
    @akrun Frankly, I think it's best to just let it go. Otherwise, the closing and reopening may go on. Just that closing a duplicate is somehow opinion based. – NelsonGon Jun 01 '19 at 17:53
  • 1
    @NelsonGon @akrun I have closed this is dupe of the link posted. Sorry, I did not realise it earlier. It can be solved as `my.data %>% group_by(name) %>% mutate(count = n(), count = count/sum(count))` from the dupe link. – Ronak Shah Jun 02 '19 at 03:09

2 Answers2

2

With base-R, you could use the following one-liner:

my.data$target <- (table(my.data$name)/nrow(my.data))[ my.data$name ]

Explanation and in several lines of code:

we use the table function to get the number of occurrences of name and divide it by the number of rows in the df with nrow. After that you look up the "name" of the current row in the "table". This value is saved in the appropriate row of the new column.

t <- table(my.data$name)/nrow(my.data)
my.data$target <- t[ my.data$name ]
my.data

  name    target
1    a 0.1666667
2    b 0.3333333
3    b 0.3333333
4    c 0.5000000
5    c 0.5000000
6    c 0.5000000
KoenV
  • 4,113
  • 2
  • 23
  • 38
1

We can use add_count to get count of each name and then divide it by number of rows using n().

library(dplyr)

my.data %>%
   add_count(name) %>%
   mutate(n = n/n())

#  name      n
#  <fct> <dbl>
#1 a     0.167
#2 b     0.333
#3 b     0.333
#4 c     0.5  
#5 c     0.5  
#6 c     0.5  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213