2

I’m using R version 2.15.3 (2013-03-01) with RStudio 0.97.312 on Ubuntu 12.10. I’m trying to create some histograms of logger data in R. However, some sensors weren’t always working, so I got some tables with #N/A and O/C in it. Here’s an excerpt of the log:

Date    Time    Type    control.value (V)   light.barrier (V)   T hotplate ('C) T mesh ('C) T exhaust ('C)  T camera ('C)   Ref. Junction 1 ('C)

30.03.2012  13:47:50    Interval    0.001   23.556  411.0   O/C 30.5    35.1    23.14
30.03.2012  13:47:51    Interval    0.001   23.556  411.1   O/C 30.3    35.2    23.14
30.03.2012  13:47:52    Interval    0.001   23.556  411.1   O/C 30.2    35.5    23.14
30.03.2012  13:47:53    Interval    0.001   23.556  410.9   O/C 29.8    35.5    23.14
30.03.2012  13:47:54    Interval    0.001   23.556  410.9   O/C 30.1    35.3    23.14
30.03.2012  13:47:55    Interval    0.001   23.556  411.1   O/C 30.2    35.4    23.14
30.03.2012  13:47:56    Interval    0.001   23.556  410.8   O/C 29.8    35.4    23.14
30.03.2012  13:47:57    Interval    0.001   23.556  410.2   O/C 29.4    35.3    23.14
30.03.2012  13:47:58    Interval    0.001   23.556  409.5   O/C 29.1    35.0    23.14
30.03.2012  13:47:59    Interval    0.000   23.556  408.9   O/C 29.3    34.6    23.14
30.03.2012  13:48:00    Interval    0.000   23.556  408.7   O/C #N/A    #N/A    23.14

Output of dput (head(logs), file = "dput.txt"): http://pastebin.de/34176

R refuses to process the columns with #N/A and O/C. I can’t reformat it by hand, the file has 185 000 lines.

When I load the log and try to create a histogram:

> logs <- read.delim("../data/logger/logs/logs.txt", header=TRUE) 
> hist (logs$mesh)

I get this error message:

Fehler in hist.default(logs$mesh) : 'x' muss nummerisch sein

Rough translation (see: How to change the locale of R in RStudio?):

Error in hist.default(logs$mesh) : 'x' must be numeric

The only columns I can create histograms from are the numerical ones listed by sapply. So I thought I have to remove these invalid values to get numeric ones.

How can I remove the invalid rows? I’m also open to other ways than processing them with R, e.g Perl or Python if that’s more suitable for this task.

This is the output of sapply after loading the log:

> sapply (logs, is.numeric)
     date          time          type control.value light.barrier      hotplate          mesh       exhaust 
    FALSE         FALSE         FALSE          TRUE         FALSE          TRUE         FALSE         FALSE 
   camera     reference 
    FALSE          TRUE 

After replacing the #N/A and O/C with NA (https://stackoverflow.com/a/16350443/2333821)

  logs.clean <- data.frame (check.rows = TRUE, apply(logs, 2, sub, pattern = "O/C|#N/A", replacement = NA))

I get this:

> sapply (logs.clean, is.numeric)
     date          time          type control.value light.barrier      hotplate          mesh       exhaust 
    FALSE         FALSE         FALSE         FALSE         FALSE         FALSE         FALSE         FALSE 
   camera     reference
    FALSE         FALSE 
Community
  • 1
  • 1
Tobias Schula
  • 789
  • 1
  • 5
  • 6
  • What do you mean by 'processing'? can you run `dput(head(DF))` where `DF` is your data.frame and post here the result? – Ferdinand.kraft May 03 '13 at 01:11
  • If you want to drop those rows, look at `?subset` and construct a logical argument to exclude those rows, e.g., `myvar != "O/C & myothervar != "#N/A"`. – Frank May 03 '13 at 01:18
  • voted down for no code, no error message, and no details on what "R refuses to process these columns" means. Sheesh. – Spacedman May 03 '13 at 08:47
  • Reformulated my original question. Hopefully it is more precise now. I’m not a native speaker and I don’t always succeed in expressing myself understandingly. – Tobias Schula May 03 '13 at 10:18
  • Rather than `sapply and is.numeric`, what is the output of `str(logs.clean)`? – mike May 03 '13 at 16:18

4 Answers4

1

Since you've specifically asked about removing rows, here's how I'd do it, with an alternative below.

#Makes some data
df <- data.frame(A = c("O/C", "#N/A", 1:3), B = c(4:6, "O/C", "#N/A"))
     # A    B
# 1  O/C    4
# 2 #N/A    5
# 3    1    6
# 4    2  O/C
# 5    3 #N/A

#Find rows that contain either value
remove <- apply(df, 1, function(row) any(row == "O/C" | row == "#N/A"))
#Subset using the negated index
df.rows <- df[!remove,]
#   A B
# 3 1 6

Alternatively you could look for the values and set them to NA, which won't remove the rows, but will allow most functions to work with the data.

df.clean <- data.frame(apply(df, 2, sub, pattern = "O/C|#N/A", replacement = NA))

I use data.frame() to convert everything to numeric quickly, there might be a more elegant way to do that...

alexwhan
  • 15,636
  • 5
  • 52
  • 66
  • Both of your examples work, I get the rows removed or the values changed to `NA`, but after that R is complaining about non numerical values, even in the columns that worked previously. – Tobias Schula May 03 '13 at 09:30
  • I would guess you need to convert the relevant columns to numeric (`as.numeric()`), but without your data there's no way to know for sure – alexwhan May 03 '13 at 09:37
  • Am I doing it right: `as.numeric (logs$hotplate)`? After that I still get the non numeric error. How much data would be useful? The file has a lot of lines. – Tobias Schula May 03 '13 at 10:30
1

Since you wrote that you are open to other ways than processing them with R ....

In a regular terminal window (not in an R-console):

grep -v  '#N/A' log.txt > cleaned.txt

The option -v inverts the matching, outputing all lines that do not match.

To grab all the lines that have neither #N/A and O/C:

grep -v '#N/A\|O/C' log.txt > cleaned.txt
mike
  • 221
  • 1
  • 7
0

read.table helpfully removes anything after the comment character.

Comment characters are defined by the comment.char parameter.

See help(read.table).

Of course, I can only guess that you are using read.table, because you don't give us sample code or error messages or nothing.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • I’m using `read.delim`, because the tables are tab-separated. Is it useful to upload the log somewhere? It has approximately 185 000 lines and a size of 20 MiB. About the error massages: http://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r-in-rstudio – Tobias Schula May 03 '13 at 09:07
  • If you want to make a shorter version of a CSV file for test purposes you can use `head -50 bigfile.csv > smallfile.csv` on the command line. Of course, make sure you have enough in the small file to demonstrate the failing behaviour! – Spacedman May 03 '13 at 10:58
0

This is an old post, but since I stumbled here, I would do this to remove rows:

df <- data.frame(A = c("O/C", "#N/A", 1:3), B = c(4:6, "O/C", "#N/A"))
#      A    B
# 1  O/C    4
# 2 #N/A    5
# 3    1    6
# 4    2  O/C
# 5    3 #N/A
cleandf <- df[!df$A %in% c("O/C", "#N/A") & !df$B %in% c("O/C", "#N/A"),]

one liner that uses conditional selection of data frames

UVRadiation
  • 73
  • 1
  • 5