0

I have a .txt file that contains multiple newspaper articles. Each article has a headline, the author name etc. I want to read the whole .txt file in R and remove every line + the next 5 lines that starts with certain words. I think gsub + reg expression might be the solution, but I do not know how to define it like the way so that not only the line containing these words is deleted, but also the next 5 lines.

Edit:

The txt. file consists of 200 Washington Post articles. Each article ends with:

lydia.depillis@washpost.com

LOAD-DATE: July 14, 2013

LANGUAGE: ENGLISH

PUBLICATION-TYPE: Web Publication


Copyright 2013 Washingtonpost.Newsweek Interactive Company, LLC d/b/a Washington
                                  Post Digital
                              All Rights Reserved

4 of 200 DOCUMENTS

Washington Post Blogs

In the Loop

June 28, 2013 Friday 3:08 PM EST

Whenever an e-mail address appears, I want to delete everything until the line where a date appears so that we have a smooth transition to the next article. I want to use a sentiment analysis and thus don't need these lines.

Werner Hertzog
  • 2,002
  • 3
  • 24
  • 36
  • Can you share your attempted code so far? – Thomas Guillerme Feb 27 '18 at 02:31
  • 1
    Please learn [how to ask](https://stackoverflow.com/help/how-to-ask) good questions, and then provide a [minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data. It's difficult to provide help without any concrete information; for example: *"that starts with certain words"*. Which words? How are we to know? – Maurits Evers Feb 27 '18 at 02:32
  • 1
    Roughly `l <- readLines("some_file.txt"); l2 <- l[-sapply(grep("^some_word", l), function(x) seq(x, x + 5))]` or do it from the command line – alistaire Feb 27 '18 at 02:46
  • Can you give examples of data and the expected output? – Onyambu Feb 27 '18 at 04:01

0 Answers0