iterate replacing elements with the most frequently occurring element in a data frame

Question

I am a novice at R programming and try to use it for my data handling.

I am trying to create new data frame by replacing some elements with the most frequently occurring element in my data frame.

My original data frame is like this :

df:
id | first_name | last_name | info_1 |infor_2
---|------------|-----------|--------|-------
 1 |  Hillary   |  Clinton  |    2   |  3
 1 |  Hillary   |  Clinton  |    10  |  2
 2 |  Donald    |  Trump    |    5   |  6
 2 |  Donald    |  Trump    |    3   |  8
 4 |  Hillary   |  Clinton  |    9   |  5
 3 |  Bernie    |  Sanders  |    5   |  0
 3 |  Donald    |  Trump    |    4   |  9
 3 |  Bernie    |  Sanders  |    24  |  9
 6 |  Bernie    |  Sanders  |    24  |  9

The new data frame should look like this:

new_df:
id | first_name | last_name | info_1 |infor_2
---|------------|-----------|--------|-------
 1 |  Hillary   |  Clinton  |    2   |  3
 1 |  Hillary   |  Clinton  |    10  |  2
 2 |  Donald    |  Trump    |    5   |  6
 2 |  Donald    |  Trump    |    3   |  8
 1 |  Hillary   |  Clinton  |    9   |  5
 3 |  Bernie    |  Sanders  |    5   |  0
 2 |  Donald    |  Trump    |    4   |  9
 3 |  Bernie    |  Sanders  |    24  |  9
 3 |  Bernie    |  Sanders  |    24  |  9

As you can see in the first data frame, "1" is the most frequently occurring id for Hillary Clionton, but there appears "4" on the 5th row. So, I want to replace all id for Hillary Clinton by "1". This operation should be applied for all others name (Bernie Sanders and Donald Trump).

To my understanding, it can be done by "if" and "for", but I couldn't find clear solution.

Any help would appreciate!

Joseph

it was just edited nice, pictures dont help, data does – Nate Nov 25 '16 at 16:29 — Nate, Nov 25 '16 at 16:29
Can you provide sample data by `dput`? – Kota Mori Nov 25 '16 at 16:29 — Kota Mori, Nov 25 '16 at 16:29
Try posting the code you have already tried – ste-fu Nov 25 '16 at 16:29 — ste-fu, Nov 25 '16 at 16:29
I edited data table. I am sorry for complicated format. – Y.KANG Nov 25 '16 at 16:50 — Y.KANG, Nov 25 '16 at 16:50

score 0 · Answer 1 · edited May 23 '17 at 12:30

0

using this great custom mode function:

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

then library(dplyr):

library(dplyr)
df %>% group_by(last_name) %>% mutate(id = Mode(id))

Source: local data frame [9 x 5]
Groups: last_name [3]

     id first_name last_name info1 info2
  <int>      <chr>     <chr> <int> <int>
1     1    Hillary   Clinton     2     3
2     1    Hillary   Clinton    10     2
3     2     Donald     Trump     5     6
4     2     Donald     Trump     3     7
5     1    Hillary   Clinton     4    11
6     3     Bernie   Sanders     3     2
7     2     Donald     Trump     5     6
8     3     Bernie   Sanders    24     8
9     3     Bernie   Sanders    12    11

edited May 23 '17 at 12:30

Community

1
1

answered Nov 25 '16 at 16:33

Nate

10,361
3
33
40

Thanks for useful comment! My real data frame is so big so there are several people who share same 'last_name', so I will try distinguish them using both 'first_name' and 'last_name'. But, your suggestion really help for me. Thanks again! – Y.KANG Nov 25 '16 at 16:58
use `group_by(...)` with as many bare comma separated values that you need to make a unique grouping – Nate Nov 25 '16 at 17:04

score 0 · Answer 2 · answered Nov 25 '16 at 16:36

0

This can be achieved by using factor on the last name:

df$id <- as.integer(factor(df$last_name, levels=c("Clinton", "Trump", "Sanders")))


df
  id first_name last_name info1 info2
1  1    Hillary   Clinton     2     3
2  1    Hillary   Clinton    10     2
3  2     Donald     Trump     5     6
4  2     Donald     Trump     3     7
5  1    Hillary   Clinton     4    11
6  3     Bernie   Sanders     3     2
7  2     Donald     Trump     5     6
8  3     Bernie   Sanders    24     8
9  3     Bernie   Sanders    12    11

To change the ID order, simply change the order that you feed to the levels argument of factor.

answered Nov 25 '16 at 16:36

lmo

37,904
9
56
69

Thanks for comment! My data is so big, so it contains almost 1,000 different last names. But, I will try to use factor as well. Thanks for help! – Y.KANG Nov 25 '16 at 17:03
You can get a unique list of the names using `levels(df$last_names)`. This can be sorted easily alphabetically. If you wanted the most common names first you could use `table` and sort on the frequencies. – lmo Nov 25 '16 at 17:06

iterate replacing elements with the most frequently occurring element in a data frame

2 Answers2