0

I am trying to manipulate text in R. I am loading word documents and want to preprocess them in such a way, that every text till a certain point is deleted.

library(readtext)

#List all documents
file_list = list.files()

#Read Texts and write them to a data table
data = readtext(file_list)

# Create a corpus
library(tm)
corp = VCorpus(VectorSource(data$text))

#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)

Now what I am trying to do is, to delete every text till a certain keyword, here "Disclosure", for each text corpus and delete everything after the word "Conclusion"

Phil
  • 7,287
  • 3
  • 36
  • 66
Fredyonge
  • 300
  • 1
  • 4
  • 17
  • Could you share an example text so it is easier for folks to help you? (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) Have you seen this? https://datascience.stackexchange.com/questions/8922/removing-strings-after-a-certain-character-in-a-given-text – Skaqqs Jul 04 '21 at 18:47
  • 1
    `stringr::str_replace_all(somevar, "^.*(?=Disclosure)", "")` and `stringr::str_replace_all(somevar, "(?<=Conclusion).*$", "")`? Please define your question a little bit more precisely, and provide a MRE if possible. – Dunois Jul 04 '21 at 18:48

1 Answers1

0

There are many ways to do what you want, but without knowing more about your case or your example it is difficult to come up with the right solution.

If you are SURE that there will only be one instance of Disclosure and one instance of Conclusion you can use the following. Also, be warned, this assumes that each document is a single content vector and will not work otherwise. It will be relatively slow, but for a few small to medium sized documents it will work fine.

All I did was write some functions that apply regex to content in a corpus. You could also do this with an apply statement instead of a tm_map.

#Read Texts and write them to a data table
data = c("My fake text Disclosure This is just a sentence Conclusion Don't consider it a file.",
         "My second fake Disclosure This is just a sentence Conclusion Don't consider it a file.")

# Create a corpus
library(tm)
library(stringr)
corp = VCorpus(VectorSource(data))

#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)

remove_before_Disclosure <- function(doc.in){
  doc.in$content <-  str_remove(doc.in$content,".+(?=Disclosure)")
  return(doc.in)
}

corp2 <- tm_map(corp,remove_before_Disclosure)

remove_after_Conclusion <- function(doc.in){
  doc.in$content <-  str_remove(doc.in$content,"(?<=Conclusion).+")
  return(doc.in)
}

corp2 <- tm_map(corp2,remove_after_Conclusion)
Adam Sampson
  • 1,971
  • 1
  • 7
  • 15