1

This is my first question here at Stack Overflow. I have a data frame in R with over a hundred columns that is supposed to have duplicates. I can't use unique() because I only want to remove row-adjacent duplicates in each column.

L = list(c("AL", "AL", "AI", "AH", "BK", "CD", "CE", "BT", "BP", 
"BD", "BI", "AL"), c("AL", "AL", "AI", "AH", "BK", "AU", "BK", 
"CD", "V", "CE", "CE"), c("AL", "AL", "AI", "AH", "AU", "BK", 
"BQ"))
do.call(cbind, lapply(L, `length<-`, max(lengths(L))))

song 1  song 2  song 3
AL  AL  AL
AL  AL  AL
AI  AI  AI
AH  AH  AH
BK  BK  AU
CD  AU  BK
CE  BK  BQ
BT  CD  
BP  V   
BD  CE  
BI  CE  
AL      



song 1  song 2  song 3
AL  AL  AL
AI  AI  AI
AH  AH  AH
BK  BK  AU
CD  AU  BK
CE  BK  BQ
BT  CD  
BP  V   
BD  CE  
BI      
AL      

I've seen previous answers that seems to work just fine for a single column.

The solution was

df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]

I've seen rle solutions, but they don't work. Considering that the columns in my data frame have different lengths, I would like to know if there is a way to loop through all the columns.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    Columns of a `data.frame` don't have different length – HubertL Sep 28 '16 at 22:42
  • It's not clear to me what you're asking since your example has one column but you are interested in some extension to multiple columns when identifying dupes. If your task is fairly standard, `rleid` from the data.table package may help. "looping through columns" sounds like something else, though... – Frank Sep 28 '16 at 22:43
  • Also, since you're concerned about *only* removing row-adjacent duplicates and not using `unique`, a good example would have some duplicates that are not row-adjacent so we can verify that they stay. – Gregor Thomas Sep 28 '16 at 23:06
  • You should have a look at [how to make a reproducible example in R](http://stackoverflow.com/q/5963269/903061) and update your question to respond to the comments above. Please pay special attention to reproducibly sharing your data object, using `dput()` or some short code to simulate. As Hubert says, data frames in R have two requirements: (1) classes are defined for each column, and (2) all columns have the same length. So when you say "*the columns in my data frame have different lengths*" it sounds very wrong, which means we need to see the object. Perhaps you just have a list? – Gregor Thomas Sep 28 '16 at 23:10
  • @Frank, probably "looping through columns" is not exactly what I meant to say, english is not my native language. I was trying to say that I was interested in some extension to multiple columns. Gregor you're right, each column was originally in a list, and I used do.call(cbind, x) – Juliana Benitez Sep 28 '16 at 23:26
  • 1
    If your columns have different lengths, leave it as a list. Otherwise as soon as you bind them together the shorter ones will be repeated. See, e.g., `do.call(cbind, list(x = 1:3, y = 1:2, z = 1))`. – Gregor Thomas Sep 28 '16 at 23:28

1 Answers1

1

Let's say you have a list like this:

songs
# $song_1
# [1] "AL" "AL" "AI" "AH" "BK" "CD" "CE" "BT" "BP" "BD" "BI" "AL"
# 
# $song_2
# [1] "AL" "AL" "AI" "AH" "BK" "AU" "BK" "CD" "V"  "CE" "CE"
# 
# $song_3
# [1] "AL" "AL" "AI" "AH" "AU" "BK" "BQ"

Shared reproducibly with dput:

songs = structure(list(song_1 = c("AL", "AL", "AI", "AH", "BK", "CD", 
"CE", "BT", "BP", "BD", "BI", "AL"), song_2 = c("AL", "AL", "AI", 
"AH", "BK", "AU", "BK", "CD", "V", "CE", "CE"), song_3 = c("AL", 
"AL", "AI", "AH", "AU", "BK", "BQ")), .Names = c("song_1", "song_2", 
"song_3"))

You can de-dupe adjacent elements in a single list item similarly to the data frame method you have in your question.

with(songs, song_1[song_1[-1] != song_1[-length(song_1)]])
# [1] "AL" "AI" "AH" "BK" "CD" "CE" "BT" "BP" "BD" "BI"

To do this to all items in the list, we use lapply with an anonymous function:

lapply(songs, function(s) s[s[-1] != s[-length(s)]])
# $song_1
# [1] "AL" "AI" "AH" "BK" "CD" "CE" "BT" "BP" "BD" "BI"
# 
# $song_2
# [1] "AL" "AI" "AH" "BK" "AU" "BK" "CD" "V" 
# 
# $song_3
# [1] "AL" "AI" "AH" "AU" "BK"

You can, of course, assign the results of lapply to a new object to to overwrite the existing object.


Note that your data took a fair bit of work to get into R because of how you posted it. Next time, please use dput() or share code to create simulated data.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I got back to this function and unfortunately it does not only delete the adjacent duplicates, it also deletes the last element even if it's not repeated, however the last deleted elements appears among the levels of the list – Juliana Benitez Aug 28 '17 at 13:33