14

I am trying to figure out the best approach in R to remove rows that contain a specific string, in my case 'no_data'.

I have data from an outside source that imputes na's with 'no_data'

an example is this:

 time  |speed  |wheels
1:00   |30     |no_data
2:00   |no_data|18
no_data|no_data|no_data
3:00   |50     |18

I want to go through the data and remove each row containing this 'no_data' string in any column. I have had a lot of trouble figuring this out. I have tried an sapply, filter, grep and combinations of the three. I am by no means an r expert so it could just be me incorrectly using these. Any help would be appreciated.

Sotos
  • 51,121
  • 6
  • 32
  • 66
lentz
  • 143
  • 1
  • 1
  • 7
  • related https://stackoverflow.com/questions/22850026/filter-rows-which-contain-a-certain-string/49858451#49858451 – tjebo Apr 01 '20 at 14:06
  • 1
    Does this answer your question? [Filter rows which contain a certain string](https://stackoverflow.com/questions/22850026/filter-rows-which-contain-a-certain-string) – tjebo Apr 01 '20 at 14:07

4 Answers4

11

We can use rowSums to create a logical vector and subset based on it

df1[rowSums(df1 == "no_data")==0, , drop = FALSE]
#   time speed wheels
#4 3:00    50     18

data

df1 <- structure(list(time = c("1:00", "2:00", "no_data", "3:00"), speed = c("30", 
"no_data", "no_data", "50"), wheels = c("no_data", "18", "no_data", 
"18")), .Names = c("time", "speed", "wheels"), class = "data.frame", 
row.names = c(NA, -4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I m getting some weird results when I do `df1 == 'no_data'`. Did you read the data frame from OP or did you create it yourself? – Sotos Jun 14 '17 at 12:34
  • 1
    @Sotos Updated the data I used – akrun Jun 14 '17 at 12:35
  • When I try this method, i get a "string is not in a standard unambiguous format" I assume this is because I have a mix of char,int,and num variables in the real data set. I failed to express that in my example... – lentz Jun 14 '17 at 12:39
  • 1
    @lentz Could you please check the `str(yourdata)` Here I have `character` class. If it is factor, convert to `character` – akrun Jun 14 '17 at 12:40
  • 1
    I believe the problem was due to a few `POSIXct` formats I had in the data. converted to `character` – lentz Jun 14 '17 at 12:45
9

edit update to the filter(if_all/if_any) syntax (dplyr vs. 1.0.10), formerly using across (now deprecated) and even before that filter_all or filter_any (superseded).

Here a dplyr option: (using Akrun's data)

library(dplyr)

df1 <- structure(list(time = c("1:00", "2:00", "no_data", "3:00"), speed = c("30", "no_data", "no_data", "50"), wheels = c("no_data", "18", "no_data", "18")), .Names = c("time", "speed", "wheels"), class = "data.frame", row.names = c(NA, -4L))

## with if_any
df1 %>% filter(if_any(everything(), ~ grepl("no_data", .)))
#>      time   speed  wheels
#> 1    1:00      30 no_data
#> 2    2:00 no_data      18
#> 3 no_data no_data no_data

## or with if_all
df1 %>% filter(if_all(everything(), ~ !grepl("no_data", .)))
#>   time speed wheels
#> 1 3:00    50     18

## to GET all rows that fulfil condition, use
df1 %>% filter(if_any(everything(), ~ grepl("no_data", .)))
#>      time   speed  wheels
#> 1    1:00      30 no_data
#> 2    2:00 no_data      18
#> 3 no_data no_data no_data
tjebo
  • 21,977
  • 7
  • 58
  • 94
8

You can read the data using na.strings = 'no_data' to set them as NA and then simply omit NAs (or take complete.cases), i.e. (Using @akrun's data set)

d1 <- read.table(text = 'time   speed  wheels
 1    1:00      30 no_data
            2    2:00 no_data      18
            3 no_data no_data no_data
            4    3:00      50      18', na.strings = 'no_data', h=TRUE)

d1[complete.cases(d1),]
#  time speed wheels
#4 3:00    50     18

#OR

na.omit(d1)
#  time speed wheels
#4 3:00    50     18
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Thank you @akrun. I couldn't get it directly from OP data as some elements were 'no_data', others ' no_data', and others remain a mystery :) – Sotos Jun 14 '17 at 12:45
3

akrun answer is quick, correct and simply as much is it can :) however if you like to make your life more complex you can also do:

dat
     time   speed  wheels
1    1:00      30 no_data
2    2:00 no_data      18
3 no_data no_data no_data
4    3:00      50      18

dat$new <- apply(dat[,1:3], 1, function(x) any(x %in% c("no_data")))
dat <- dat[!(dat$new==TRUE),]
dat$new <- NULL

dat
  time speed wheels
4 3:00    50     18
Adamm
  • 2,150
  • 22
  • 30