2

I would like to remove all lines from a file which start with a certain pattern. I would like to do this with R. It is good practice to not first read the whole file, then remove all matching lines and afterwards write the whole file, as the file can be huge. I am thus wondering if I can have both a read and a write connection (open all the time, one at a time?) to the same file. The following shows the idea (but 'hangs' and thus fails).

## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()

## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
while(TRUE) {
    rcon <- file(fnm, "r") # read connection
    line <- readLines(rcon, n = 1) # read one line
    close(rcon)
    if(length(line) == 0) { # end of file
        break
    } else {
        if(!grepl(pat, line)) {
            wcon <- file(fnm, "w")
            writeLines(line, con = wcon)
            close(wcon)
        }
    }
}

Note:

1) See here for an answer if one writes to a new file. One could then delete the old file and rename the new one to the old one, but that does not seem very elegant :-).

2) Update: The following MWE produces

Hello
world
-
world

See:

## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()

## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
con <- file(fnm, "r+") # read and write connection
while(TRUE) {
    line <- readLines(con, n = 1L) # read one line
    if(length(line) == 0) break # end of file
    if(!grepl(pat, line))
        writeLines(line, con = con)
}
close(con)
Marius Hofert
  • 6,546
  • 10
  • 48
  • 102

2 Answers2

2

I think you just need open = 'r+'. From ?file:

Modes

"r+", "r+b" -- Open for reading and writing.

I don't have your sample file, so I'll instead just have the following minimal example:

take a file with a-z on 26 lines and replace them one by one with A-Z:

tmp = tempfile()
writeLines(letters, tmp)
f = file(tmp, 'r+')
while (TRUE) {
  l = readLines(f, n = 1L)
  if (!length(l)) break
  writeLines(LETTERS[match(l, letters)], f)
}
close(f)

readLines(f) afterwards confirms this worked.

Community
  • 1
  • 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Hi, thanks. My file was provided (sink()). If I adapt your idea to my MWE, I obtain the above 'Update'. Why are there two additional lines written to the file? ("-" and "world"). – Marius Hofert Feb 21 '19 at 02:59
  • ... or, to use your example, if you add `if(startsWith(l, "e"))` before the last `writeLines` command, have a look at the file afterwards to see what's happening (for that it might be easier to write to a concrete file as I did in my example). – Marius Hofert Feb 21 '19 at 22:27
2

I understand you want to use R, but just in case you're not aware, there are some really simple scripting tools that excel in this type of task. E.g gawk is designed for pretty much exactly this type of operation and is simple enough to learn that you could write a script for this within minutes even without any prior knowledge.

Here's a one-liner to do this in gawk (or awk if you are on Unix):

gawk -i inplace '!/^pat/ {print}' foo.txt

Of course, it is trivial to do this from within R using

system(paste0("gawk -i inplace '!/^", pat, "/ {print}' ", fnm))
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thanks. I was aware of that, that's why I explicitly mentioned R. – Marius Hofert Feb 21 '19 at 03:15
  • Let me be more precise: I'm interested to learn how reading/writing to/from a connection works in R. I found solutions to one of them at a time here in the forum but not in this use-case, although it is recommended (see the most upvoted answer [here](https://stackoverflow.com/questions/12626637/reading-a-text-file-in-r-line-by-line)) to not first read in the whole file and then write the whole file. – Marius Hofert Feb 21 '19 at 12:57