2

I have a dataset with 4 columns containing names, where the number of names and the order of names differ between columns. Some columns can also contain the same name twice or more. It looks like follows:

df<- data.frame(x1=c("Ben","Alex","Tim", "Lisa", "MJ","NA", "NA","NA","NA"), 
x2=c("Ben","Paul","Tim", "Linda", "Alex", "MJ", "Lisa", "Ken","NA"), 
x3=c("Tomas","Alex","Ben", "Paul", "MJ", "Tim", "Ben", "Alex", "Linda"), 
x4=c("Ben","Alex","Tim", "Lisa", "MJ", "Ben", "Barbara","NA", "NA"))

Now I have to first extract the unique names within the dataset. I did that using the following code:

u<- as.vector(unique(unlist(df)))

Second, I need to find the names that can be found in all 4 columns (class A names), in 3 out of 4 columns (class B names) and in 2 out of 4 columns (class C names).

Here is where I get stuck. I can only extract the names that are contained in all 4 columns using:

n<- ifelse(u%in%df$x1 & u%in%df$x2 & u%in%df$x3 & 
               u%in%df$x4", A, B)

So, e.g., Ben would be a A class name because it can be found in all 4 columns and Lisa would be a B class name because it can only be found in 3 out of 4 columns.

Name Class
Ben    A
Lisa   B

Is there a nicer way to classify the unique names according to the number of columns they can be found in and how can it be done for B and C class names?

Thanks in advance!

ZayzayR
  • 183
  • 9
  • Sorry for that! I added some "NA" rows for the shorter columns. So it resembles more the original data now. – ZayzayR Aug 04 '20 at 13:23
  • https://stackoverflow.com/questions/63235393/same-calculations-over-different-datasets/63241291#63241291 You created a topic with a similar question. See the answer. Does it fit? – Yuriy Saraykin Aug 04 '20 at 13:47
  • `table(unlist(lapply(df, levels)))` (if these are factors) or `table(unlist(lapply(df, unique)))` (if those are characters) is another option – David Arenburg Aug 04 '20 at 13:53

3 Answers3

3

Here is an idea via base R. We loop over the unique elements and take the rowSums when the dataframe equals with each name. The maximum result is your expected output

sapply(unique(c(t(df))), function(i) max(rowSums(df == i)))

#  Ben Tomas  Alex  Paul   Tim  Lisa Linda    MJ 
#    3     1     3     1     3     2     1     3 
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Thanks you for the answer! But I need the to know in how many columns the names can be found at least once. So for Ben it should be a 4 because the name Ben can be found in all 4 columns (at least once). Do you know a solution for that? – ZayzayR Aug 04 '20 at 13:27
2

You can get the data in long format and for each name find how many unique column it occurs :

library(dplyr)

df %>%
  tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
  group_by(value) %>%
  summarise(count = n_distinct(name))

#   value   count
#   <chr>   <int>
# 1 Alex        4
# 2 Barbara     1
# 3 Ben         4
# 4 Ken         1
# 5 Linda       2
# 6 Lisa        3
# 7 MJ          4
# 8 NA          3
# 9 Paul        2
#10 Tim         4
#11 Tomas       1

Here you get "NA" in the output because it is a string. If your data has real NA's it will be dropped because of values_drop_na = TRUE.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey, thanks a lot! This is the perfect solution to my problem! Do you think that I can make a for loop out of that and apply it to several datasets, let's say 5 (df1, df2, df3, df4, df5) at once? Do you also know a solution for that? – ZayzayR Aug 04 '20 at 13:56
  • 1
    Put the above code in a function (let's say `fun`). Then put all dataframes in a list, `lst_data <- list(df1, df2...)` and use `map(lst_data, fun)`. – Ronak Shah Aug 04 '20 at 14:02
2

A simple base R option using aggregate + stack

aggregate(.~values,unique(stack(df)),length)

such that

> aggregate(.~values,unique(stack(df)),length)
    values ind
1     Alex   4
2  Barbara   1
3      Ben   4
4      Ken   1
5    Linda   2
6     Lisa   3
7       MJ   4
8       NA   3
9     Paul   2
10     Tim   4
11   Tomas   1
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81