0

I have a data frame containing emails. There is a column named "message" that looks like this:

> > dataset$message[1]  
>[1] Message-ID:...
> 
> Date: ...
> 
> From: ...
> 
> To:...
> 
> Subject: ...
> 
> Mime-Version: ...
> 
> Content-Type:...
> 
> Content-Transfer-Encoding: ...
> 
> X-From:...
> 
> X-To: ...
> 
> X-cc:...
> 
> X-bcc: ...
> 
> X-Folder: ...
> 
> X-Origin: ...
> 
> X-FileName: ...
>  
> > Some message text

In other words, each entry contains 15 lines of headers and then the text. What I want is to remove these 15 lines from each row and be left only with the text, so that

>dataset$message[1]

looks like this:

> Some message text
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 2
    Please provide a [reproducible example](http://stackoverflow.com/questions/5963269) along with expected output. Also don't forget to post your attempt that failed. Cheers – Sotos Nov 22 '18 at 14:11
  • Once the data is inside the data.frame it’s too late. You want to remove it *before* it gets read into the data.frame, e.g. by providing the appropriate arguments to `read.table`. – Konrad Rudolph Nov 22 '18 at 14:13

1 Answers1

1

Something like this would work:

sub("^(?:.*\\n){15}", "", multiline_string_mail, perl = TRUE)

#[1] "Super secret message"

example data: (you should always provide usable example data)

multiline_string_mail =
"hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
hehe
Super secret message"
Andre Elrico
  • 10,956
  • 6
  • 50
  • 69