3

I have a df with dimension 58000*900 which contains replicates in row values, I want to traverse through every row and remove them. An example will make it more clear.

df
IDs Name    col1    col2    col3
123 AB.C    1.3,1.3,1.3,1.3,1.3 0,0,0,0,0   5,5,5,5,5
234 CD-E    2,2,2,2,2   0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456      123456              123456
345 FGH 9,9,9,9,9   54,54,54,54,54  0,0,0,0,0

Apparently every value is replicated 5 times and in some cases their is a problem that there is no . or , separating the values. What I want is drop those lines which does not contain either . or , and for the rest remove the duplicate values. So, the output will be:

IDs Name    col1    col2    col3
123 AB.C    1.3 0   5
234 CD-E    2   0.3 1
345 FGH 9   54  0

dput(df)
structure(list(IDs = c(123L, 234L, 568L, 345L), Name = structure(c(1L, 
2L, 4L, 3L), .Label = c("ABC", "CDE", "FGH", "GHJ"), class = "factor"), 
    col1 = structure(c(2L, 3L, 1L, 4L), .Label = c("123456", 
    "1.3,1.3,1.3,1.3,1.3", "2,2,2,2,2", "9,9,9,9,9"), class = "factor"), 
    col2 = structure(1:4, .Label = c("0,0,0,0,0", "0.3,0.3,0.3,0.3,0.3", 
    "123456", "54,54,54,54,54"), class = "factor"), col3 = structure(c(4L, 
    2L, 3L, 1L), .Label = c("0,0,0,0,0", "1,1,1,1,1", "123456", 
    "5,5,5,5,5"), class = "factor")), .Names = c("IDs", "Name", 
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA, 
-4L))
Newbie
  • 411
  • 5
  • 18

3 Answers3

4

First, we restructure your data in a long format using gather(), then we filter() for value with no , using grepl(). We then split the string in value into a list using strsplit() and make each element of the list it's own row using unnest(). We remove duplicated rows using distinct() and spread() back the key and values to columns.

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -(IDs:Name)) %>%
  filter(grepl(",", value)) %>%
  mutate(value = strsplit(value, ",")) %>%
  unnest(value) %>%
  distinct %>%
  spread(key, value)

Which gives:

#Source: local data frame [3 x 5]
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (chr) (chr) (chr)
#1   123   AB.C   1.3     0     5
#2   234   CD-E     2   0.3     1
#3   345    FGH     9    54     0

Another idea would be to use cSplit from splitstackshape:

df %>%
  cSplit(., c("col1", "col2", "col3"), direction = "long", sep = ",") %>%
  group_by(Name) %>%
  filter(!any(is.na(.))) %>%
  distinct

Which gives:

#Source: local data table [3 x 5]
#Groups: Name
#
#    IDs   Name  col1  col2  col3
#  (int) (fctr) (dbl) (dbl) (int)
#1   123   AB.C   1.3   0.0     5
#2   234   CD-E   2.0   0.3     1
#3   345    FGH   9.0  54.0     0
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • It works, but gives the `Warning message: attributes are not identical across measure variables; they will be dropped`. I searched this error (http://stackoverflow.com/questions/28972386/retain-attributes-when-using-gather-from-tidyr-attributes-are-not-identical) but could not understand the reason !! – Newbie May 26 '16 at 13:35
  • You can safely ignore this warning, it just coerces each factor to character and drops their attributes when creating the `value` column in the result. – Steven Beaupré May 26 '16 at 13:38
1

Here is a base R method that works for your sample data:

df <- read.table(header=T, text="IDs Name    col1    col2    col3
 123 ABC 1.3,1.3,1.3,1.3,1.3 0,0,0,0,0   5,5,5,5,5
                  234 CDE 2,2,2,2,2   0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
                  568 GHJ 123456      123456              123456
                  345 FGH 9,9,9,9,9   54,54,54,54,54  0,0,0,0,0")

# drop rows with no comma or dot
df <- df[-grep("[,.]", df$col1, invert=T),]

df[,grep("^col", names(df))] <- sapply(df[,grep("^col", names(df))], 
                                       function(i) gsub("^([0-9.]+),.*", "\\1", i))

This returns

  IDs Name   col1   col2   col3
1 123  ABC    1.3      0      5
2 234  CDE      2    0.3      1
3 568  GHJ 123456 123456 123456
4 345  FGH      9     54      0

We use regular expression functions grep and gsub to select the correct columns and to remove the part of each string that comes after a comma.

lmo
  • 37,904
  • 9
  • 56
  • 69
1

The long apply way in base R:

as.data.frame( apply( df, c(1,2), gsub, pattern="(\\d*[.]*\\d*),.*", replacement="\\1") )

Which gives:

  IDs Name   col1   col2   col3
1 123  ABC    1.3      0      5
2 234  CDE      2    0.3      1
3 568  GHJ 123456 123456 123456
4 345  FGH      9     54      0

The idea is to keep only the first element before the first comma for each item

Drawback (?) it keeps the rows with no decimal values.

Tensibai
  • 15,557
  • 1
  • 37
  • 57