3

I have a dataframe where I would like to concatenate certain columns.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

For example, if I had a data frame such as:

  Animal1         Animal2        Label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

So by using the paste function I can concatenate the columns...

data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ")

However, I haven't managed yet to remove duplicates. The output I'm getting is of course just from my concatenation:

  Output1
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81

Row 1 is fine, but the other rows contain duplicates as described above.

The output I would desire is:

  Output1
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

I tried removing duplicates after concatenating. I know that within a string you can do something like the example below (e.g. Removing duplicate words in a string in R).

d <- unlist(strsplit(data1, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

This did work for me when I was just using a string but I couldn't apply it to the whole column as I received an error 'unexpected symbol' referring to the square brackets.

I have seen that there is also the unique() function e.g. Remove Duplicated String in a Row, Deleting reversed duplicates with R

reduce_row = function(i) {
  split = strsplit(i, split=", ")[[1]]
  paste(unique(split), collapse = ", ") 
}
data1$v2 = apply(data1, 1, reduce_row)

I tried to use these examples, but as yet have not been successful.

Any assistance would be very much appreciated.

Community
  • 1
  • 1
K_B1
  • 93
  • 1
  • 8

1 Answers1

3

After you've done data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ") :

data.frame(Output1 = vapply(strsplit(data1, " +"), function(x) paste(unique(x), collapse = " "), character(1)))
#              Output1
# 1 cat dog dolphin 19
# 2         dog cat 72
# 3  pilchard 26 koala
# 4        newt bat 81
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • Hi, thanks for your quick reply, much appreciated. I copied your code but get the error Error in strsplit(data1, " +") : non-character argument. If I then added `data1 <- as.character(data1)`, then tried to run the code, it changes to: ` Output1` `1 c(1, 2, 4, 3)` – K_B1 Mar 16 '17 at 12:29
  • Have you overwritten `data1` after doing `data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ")`? – Aurèle Mar 16 '17 at 15:16
  • Not initially, no. I did try after receiving the error, but it still didn't work. – K_B1 Mar 16 '17 at 16:23
  • What does `data1` look like then? – Aurèle Mar 16 '17 at 16:26
  • Data 1 looks like my second example of a dataframe where Output1 is the header (folowing line 10 and before the example of my preferred output). If I changed it to character it just changes to one row containing "c(1, 2, 4, 3)" – K_B1 Mar 16 '17 at 16:59
  • I initially read the data in like `data <- read.csv("test.csv", header = TRUE, sep= ",")` if that makes any difference, but other than that, the rest of the code I have used is as above. – K_B1 Mar 16 '17 at 17:03
  • Is `data1` not the direct result of `paste(data$Animal1, ........`? Have you converted it to a dataframe? Could you then try with `data1$Output1` instead of `data1`? – Aurèle Mar 16 '17 at 17:04
  • Yes it is the direct result of that. Have just tried your suggestion and am getting the same error `Error in strsplit(data1$Output1, " +") : non-character argument` – K_B1 Mar 16 '17 at 17:32
  • `Factor w/ 4 levels "cat dog dolphin 19",..: 1 2 4 3` – K_B1 Mar 17 '17 at 09:58
  • Ok then I guess you need: `strsplit(as.character(data1$Output1), " +")` – Aurèle Mar 17 '17 at 10:21
  • 1
    That works, I then get something like this (I'm just showing you the first row of the output): `[[1]]` `[1] "cat" "dog" "dolphin" "19" ` However, if I proceed to the next step, I still get the error I mentioned before. I then took out the `strsplit` to see if that made a difference: `data.frame(Output1 = vapply(function(x) paste(unique(x), collapse = " "), character(1)))` and I get the error `Error in get(as.character(FUN), mode = "function", envir = envir) : invalid first argument` – K_B1 Mar 17 '17 at 10:32
  • 1
    Sorry, I'm getting lost with all those errors. Which is it already? What does `vapply(strsplit(as.character(data1$Output1), " +"), function(x) paste(unique(x), collapse = " "), character(1))` give? – Aurèle Mar 17 '17 at 10:44
  • 1
    Ah, my apologies. My explanation probably wasn't clear enough. But yes, this works! I get the same as your example after this. Thank you very much for your assistance and patience. Deeply appreciated. Just out of curiosity, is there a way to get it in the correct order e.g. `pilchard koala 26` instead of `pilchard 26 koala`? Maybe I'd have too sort()? Or is this just because the unique function takes the first instance of the word/ number. – K_B1 Mar 17 '17 at 10:53
  • You may wrap `unique(x)` in `sort(unique(x), decreasing = TRUE)` to output numbers last – Aurèle Mar 17 '17 at 15:13