I have a dataset with 4 columns containing names, where the number of names and the order of names differ between columns. Some columns can also contain the same name twice or more. It looks like follows:
df<- data.frame(x1=c("Ben","Alex","Tim", "Lisa", "MJ","NA", "NA","NA","NA"),
x2=c("Ben","Paul","Tim", "Linda", "Alex", "MJ", "Lisa", "Ken","NA"),
x3=c("Tomas","Alex","Ben", "Paul", "MJ", "Tim", "Ben", "Alex", "Linda"),
x4=c("Ben","Alex","Tim", "Lisa", "MJ", "Ben", "Barbara","NA", "NA"))
Now I have to first extract the unique names within the dataset. I did that using the following code:
u<- as.vector(unique(unlist(df)))
Second, I need to find the names that can be found in all 4 columns (class A names), in 3 out of 4 columns (class B names) and in 2 out of 4 columns (class C names).
Here is where I get stuck. I can only extract the names that are contained in all 4 columns using:
n<- ifelse(u%in%df$x1 & u%in%df$x2 & u%in%df$x3 &
u%in%df$x4", A, B)
So, e.g., Ben would be a A class name because it can be found in all 4 columns and Lisa would be a B class name because it can only be found in 3 out of 4 columns.
Name Class
Ben A
Lisa B
Is there a nicer way to classify the unique names according to the number of columns they can be found in and how can it be done for B and C class names?
Thanks in advance!