1

I have a long datafile of the kind:

# Comment line 1
# Comment line 2
# ... many more lines
# values intensities
5.556667e+00    4.008450e+02
5.581000e+00    4.008770e+02
... many more values
# End comments

I would like to create a functions which on this object would provide:

[1] "values" "intensities"

What would you advice me to do?

leparc
  • 133
  • 5
  • Do you know how many comment lines there are? If yes, you could just specify to skip them when reading in the file and specify the sep argument to be white spaces (see here https://stackoverflow.com/questions/39110755/skip-specific-rows-using-read-csv-in-r) – Alex May 09 '21 at 08:51
  • No unfortunately it can vary – leparc May 09 '21 at 08:59

2 Answers2

0

readLines can read the data in, then grep the comment character. In the function below, the comment character defaults to the question's "#".

fun <- function(file, char = "#"){
  x <- readLines(con = file)
  y <- x[which(diff(grep(char, x)) != 1)]
  unlist(strsplit(y, " "))[-1]
}

fun("filename.txt")
#[1] "values"      "intensities"

If you have a long datafile and it doesn't fit in memory and have awk available, the following solution can read the data without memory problems.

read_awk <- function(file, char = "#"){
  cmd <- "awk"
  pattern <- paste0("/^[^", char, "]/")
  awkcmd <- paste0("'", pattern, " {print NR - 1; exit 0}'")
  args <- c(awkcmd, file)
  out <- system2(command = cmd, args = args, stdout = TRUE)
  as.integer(out)
}
fun_awk <- function(file, char = "#"){
  n <- read_awk(file, char = char)
  x <- scan(file = file, what = character(), sep = "\n", skip = n - 1, nlines = 1)
  unlist(strsplit(x, " "))[-1]
}

fun_awk("filename.txt")
#Read 1 item
#[1] "values"      "intensities"

Data

"filename.txt" is the following file:

# Comment line 1
# Comment line 2
# ... many more lines
# values intensities
5.556667e+00    4.008450e+02
5.581000e+00    4.008770e+02
# End comments
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Many thanks for your help. I have the impression the script is almost working but not in my case. which(diff(grep("#", x)) != 1) leads to integer(0) while grep("#", x) correctly provides the lines in file_lines which contain the "#" character – leparc May 09 '21 at 11:14
  • @leparc Can you run just `diff(grep("#", x))` and post the output? If it's all 1's then all lines with "#" are consecutive. – Rui Barradas May 09 '21 at 11:25
0

Depending on how many white spaces you have between the columns you might want to use a regular expression here:

data <- as.tibble(read.delim('test.txt', header = F))
data <- data[!startsWith(data$V1,'#'),] %>%
    separate(V1, into = c('values', 'intensities'), sep = '\\s+')
data

# A tibble: 2 x 2
  values       intensities 
  <chr>        <chr>       
1 5.556667e+00 4.008450e+02
2 5.581000e+00 4.008770e+02
Alex
  • 474
  • 4
  • 12