I have a dataframe where I would like to concatenate certain columns.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
For example, if I had a data frame such as:
Animal1 Animal2 Label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
So by using the paste function I can concatenate the columns...
data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ")
However, I haven't managed yet to remove duplicates. The output I'm getting is of course just from my concatenation:
Output1
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
Row 1 is fine, but the other rows contain duplicates as described above.
The output I would desire is:
Output1
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I tried removing duplicates after concatenating. I know that within a string you can do something like the example below (e.g. Removing duplicate words in a string in R).
d <- unlist(strsplit(data1, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
This did work for me when I was just using a string but I couldn't apply it to the whole column as I received an error 'unexpected symbol' referring to the square brackets.
I have seen that there is also the unique() function e.g. Remove Duplicated String in a Row, Deleting reversed duplicates with R
reduce_row = function(i) {
split = strsplit(i, split=", ")[[1]]
paste(unique(split), collapse = ", ")
}
data1$v2 = apply(data1, 1, reduce_row)
I tried to use these examples, but as yet have not been successful.
Any assistance would be very much appreciated.