0

I am a novice at R programming and try to use it for my data handling.

I am trying to create new data frame by replacing some elements with the most frequently occurring element in my data frame.

My original data frame is like this :

df:
id | first_name | last_name | info_1 |infor_2
---|------------|-----------|--------|-------
 1 |  Hillary   |  Clinton  |    2   |  3
 1 |  Hillary   |  Clinton  |    10  |  2
 2 |  Donald    |  Trump    |    5   |  6
 2 |  Donald    |  Trump    |    3   |  8
 4 |  Hillary   |  Clinton  |    9   |  5
 3 |  Bernie    |  Sanders  |    5   |  0
 3 |  Donald    |  Trump    |    4   |  9
 3 |  Bernie    |  Sanders  |    24  |  9
 6 |  Bernie    |  Sanders  |    24  |  9

The new data frame should look like this:

new_df:
id | first_name | last_name | info_1 |infor_2
---|------------|-----------|--------|-------
 1 |  Hillary   |  Clinton  |    2   |  3
 1 |  Hillary   |  Clinton  |    10  |  2
 2 |  Donald    |  Trump    |    5   |  6
 2 |  Donald    |  Trump    |    3   |  8
 1 |  Hillary   |  Clinton  |    9   |  5
 3 |  Bernie    |  Sanders  |    5   |  0
 2 |  Donald    |  Trump    |    4   |  9
 3 |  Bernie    |  Sanders  |    24  |  9
 3 |  Bernie    |  Sanders  |    24  |  9

As you can see in the first data frame, "1" is the most frequently occurring id for Hillary Clionton, but there appears "4" on the 5th row. So, I want to replace all id for Hillary Clinton by "1". This operation should be applied for all others name (Bernie Sanders and Donald Trump).

To my understanding, it can be done by "if" and "for", but I couldn't find clear solution.

Any help would appreciate!

Joseph

Y.KANG
  • 1
  • 2

2 Answers2

0

using this great custom mode function:

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

then library(dplyr):

library(dplyr)
df %>% group_by(last_name) %>% mutate(id = Mode(id))

Source: local data frame [9 x 5]
Groups: last_name [3]

     id first_name last_name info1 info2
  <int>      <chr>     <chr> <int> <int>
1     1    Hillary   Clinton     2     3
2     1    Hillary   Clinton    10     2
3     2     Donald     Trump     5     6
4     2     Donald     Trump     3     7
5     1    Hillary   Clinton     4    11
6     3     Bernie   Sanders     3     2
7     2     Donald     Trump     5     6
8     3     Bernie   Sanders    24     8
9     3     Bernie   Sanders    12    11
Community
  • 1
  • 1
Nate
  • 10,361
  • 3
  • 33
  • 40
  • Thanks for useful comment! My real data frame is so big so there are several people who share same 'last_name', so I will try distinguish them using both 'first_name' and 'last_name'. But, your suggestion really help for me. Thanks again! – Y.KANG Nov 25 '16 at 16:58
  • use `group_by(...)` with as many bare comma separated values that you need to make a unique grouping – Nate Nov 25 '16 at 17:04
0

This can be achieved by using factor on the last name:

df$id <- as.integer(factor(df$last_name, levels=c("Clinton", "Trump", "Sanders")))


df
  id first_name last_name info1 info2
1  1    Hillary   Clinton     2     3
2  1    Hillary   Clinton    10     2
3  2     Donald     Trump     5     6
4  2     Donald     Trump     3     7
5  1    Hillary   Clinton     4    11
6  3     Bernie   Sanders     3     2
7  2     Donald     Trump     5     6
8  3     Bernie   Sanders    24     8
9  3     Bernie   Sanders    12    11

To change the ID order, simply change the order that you feed to the levels argument of factor.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • Thanks for comment! My data is so big, so it contains almost 1,000 different last names. But, I will try to use factor as well. Thanks for help! – Y.KANG Nov 25 '16 at 17:03
  • You can get a unique list of the names using `levels(df$last_names)`. This can be sorted easily alphabetically. If you wanted the most common names first you could use `table` and sort on the frequencies. – lmo Nov 25 '16 at 17:06