I need some help using r data frames. Here is the data frame:
group col1 col2 name
1 dog 40 canidae
1 dog 40 canidae
1 dog 40 canidae
1 dog 40 canidae
1 dog 40
1 dog 40 canidae
1 dog 40 canidae
2 frog 85 dendrobatidae
2 frog 89 leptodactylidae
2 frog 89 leptodactylidae
2 frog 82 leptodactylidae
2 frog 89
2 frog 81
2 frog 89 dendrobatidae
3 horse 87 equidae1
3 donkey 76 equidae2
3 zebra 67 equidae3
4 bird 54 psittacidae
4 bird 56
4 bird 34
5 bear 67
5 bear 54
What I would like to get is to add a column "consensus_name" an get :
group col1 col2 name consensus_name
1 dog 40 canidae canidae
1 dog 40 canidae canidae
1 dog 40 canidae
1 dog 40 canidae canidae
1 dog 40 canidae canidae
2 frog 85 dendrobatidae leptodactylidae
2 frog 89 leptodactylidae leptodactylidae
2 frog 89 leptodactylidae leptodactylidae
2 frog 82 leptodactylidae leptodactylidae
2 frog 89 leptodactylidae
2 frog 81 leptodactylidae
2 frog 89 dendrobatidae leptodactylidae
3 horse 87 equidae1 equidae3
3 donkey 76 equidae2 equidae3
3 zebra 67 equidae3 equidae3
4 bird 54 psittacidae psittacidae
4 bird 56 psittacidae
4 bird 34 psittacidae
5 bear 67 NA
5 bear 54 NA
In order to get this new column for each group, I get the name which is the most representative of the group.
For the
group1
there are 4 rows with the name'canidae'
and one with nothing, so for each one I write'canidae'
in the columnconsensus_name
For the
group2
there are 2 rows with the name'dendrobatidae'
, 2 with nothing and 3 rows with the name'leptodactylidae'
so for each one I write 'leptodactylidae'
in the columnconsensus_name
.For the
group3
there are 3 rows with different names, so because there is no consensus, I get the name which as the lowestcol2
number, so I write'equidae3'
in the columnconsensus_name
.For the group 4 only one row have an information, so it is the consensus_name of the
group4
, so I writepsittacidae
in the columnconsensus_name
.For the
group5
there is none informations, then just write NA in theconsensus_name
column.
Does anyone have any idea to do it with R ? Thank for your help :)
Here is the df:
structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), col1 = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 3L, 6L,
1L, 1L, 1L), .Label = c("bird", "dog", "donkey", "frog", "horse",
"zebra"), class = "factor"), col2 = c(40L, 40L, 40L, 40L, 40L,
40L, 40L, 85L, 89L, 89L, 82L, 89L, 81L, 89L, 87L, 76L, 67L, 54L,
56L, 34L), name = structure(c(2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L,
7L, 7L, 7L, 1L, 1L, 3L, 4L, 5L, 6L, 8L, 1L, 1L), .Label = c("",
"canidae", "dendrobatidae", "equidae1", "equidae2", "equidae3",
"leptodactylidae", "psittacidae"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
the real one has around 50 000 rows.