I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from Lexis Nexis.
For each document I want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.
So I found this question which is similar to my problem Extract unknown words from a recurrent pattern
But I have several problems getting to the solution.
First off, I don't know how to correctly upload my Data from the single documents into R for further processing with a regex formula.
Secondly I have problems with understanding and applying the sub formula myself. See this formula, which I found:
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])
I have difficulties adapting the pattern part of sub (the first part I assume) to my problem. Also I don't know what the second part means. For the third part I know that this is the source of the text but I don't know what [,5] means.
Here the code in full:
tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])
also a txt file I use: https://www.dropbox.com/s/e24ywni8z3s8wqk/SolarWorldAG_25.03.2008_1.HTML.txt?dl=0
My knowledge of R is currently Swirl courses and specifically on text mining https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html