8

I have something similar to this:

date        pgm      in.x     logs       out.y
20130514    na       12       j1         12
20131204    z2       03       j1         03
20130516    a01      04       j0         04
20130628    z1       05       j2         05

I noticed that the in and out values are always the same so I want to delete the out.y column. And I have other columns like this I want to be able to detect any .y columns that match .x columns and delete them after I do the merge.

Frank
  • 66,179
  • 8
  • 96
  • 180
Chayma Atallah
  • 725
  • 2
  • 13
  • 30

3 Answers3

13

If we assume all column redundancies should be removed

no_duplicate <- data_set[!duplicated(as.list(data_set))]

will do the trick.

as.list will convert the data.frame to a list of all its columns, and duplicated will return indices for those columns that have all values as a duplicate of a previously seen column.

This does not directly try to compare .x and .y columns, but has the effect of retaining one copy of each duplicated column, which I assume is the main goal. On the other hand, it will also remove any .x columns that are duplicates of another .x column.


If we want to retain all .x columns, even those that are duplicates, a good solution might be to do filtering before the merge. Assuming you have data_x and data_y that will be merged by column "identifier":

data_y_nonredundant <- data_y[!(as.list(data_y) %in% as.list(data_x) & names(data_y)!="identifier")]
data <- merge(data_x, data_y_nonredundant, by=c("identifier"))
Alex A.
  • 2,646
  • 22
  • 36
  • 1
    Thanks a lot !!! This is a good solution but it only works when you don't have coluns that are different but for the time beeing have all zero values. – Chayma Atallah Jun 01 '16 at 09:22
  • You are right! If all x-columns should always be kept, even if they are duplicates, it doesn't work. I have added another approach that will now. – Alex A. Jun 01 '16 at 10:09
0

I have created one more variable to your dataframe which is replica of out.y

x <- data.frame(date  = c("20130514","20131204","20130516","20130628"),
  pgm = c(NA, "z2", "a01", "z1"), in.x= c(12, 3, 4, 5), out.y= c(12, 3, 4, 5),new.y = c(12, 3, 4, 5))

y <- x[grepl(".x|.y",colnames(x))]

 in.x out.y new.y
1   12    12    12
2    3     3     3
3    4     4     4
4    5     5     5

y$in.x==y[,c("out.y","new.y")]
     out.y new.y
[1,]  TRUE  TRUE
[2,]  TRUE  TRUE
[3,]  TRUE  TRUE
[4,]  TRUE  TRUE

x <- x[,1:3]

      date  pgm in.x
1 20130514 <NA>   12
2 20131204   z2    3
3 20130516  a01    4
4 20130628   z1    5
Arun kumar mahesh
  • 2,289
  • 2
  • 14
  • 22
-4

These answers from our colleagues are undoubtfully right, but a simpler manner is by :

dataframe[,5]  <- NULL
adrian1121
  • 904
  • 2
  • 9
  • 21
  • 2
    I doubt that this is what they're looking for – talat Jun 01 '16 at 09:18
  • @docendodiscimus it clearly does not detect automatically the columns but it may do the trick if you are lazy (like me) and don't want to write too many lines – adrian1121 Jun 01 '16 at 09:19
  • 2
    I don't agree on the comment for lazy people. Because you'll have to figure out each column _manually_ in this approach (it's not just about 1 column) whereas a lazy person would write code to "automate" this process – talat Jun 01 '16 at 09:21
  • @docendodiscimus that's true, I buy you the argument – adrian1121 Jun 01 '16 at 09:22
  • Nope, like @docendodiscimus said I don't know which column it will be so I need an automated solution, and I am a biginner in R. – Chayma Atallah Jul 06 '16 at 08:16