Extracting unknown dates from txt/HTML files using R

Question

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my PC in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from Lexis Nexis.

For each document I want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.

So I found this question which is similar to my problem Extract unknown words from a recurrent pattern

But I have several problems getting to the solution.
First off, I don't know how to correctly upload my Data from the single documents into R for further processing with a regex formula.

Secondly I have problems with understanding and applying the sub formula myself. See this formula, which I found:

sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

I have difficulties adapting the pattern part of sub (the first part I assume) to my problem. Also I don't know what the second part means. For the third part I know that this is the source of the text but I don't know what [,5] means.

Here the code in full:

tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

also a txt file I use: https://www.dropbox.com/s/e24ywni8z3s8wqk/SolarWorldAG_25.03.2008_1.HTML.txt?dl=0

My knowledge of R is currently Swirl courses and specifically on text mining https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html

Please simply show us an excerpt of the input, along with the expected output and the stuff you tried directly in SO. Eg. your Dropbox link isn't working and not considered good practice here on SO — nozzleman, Nov 01 '16 at 14:02

Serban Tanasa · Accepted Answer · 2016-11-02T10:45:00.653

The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.

To achieve specifically what you asked for, try gregexpr w/ regmatches:

fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
mytxt <- readChar(fileName, file.info(fileName)$size)
regmatches(mytxt, regexec("UPDATE:",mytxt))

regmatches(mytxt, gregexpr(
"UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}", 
mytxt))

It says, in English: look for the literal UPDATE: followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 8 letters, followed by a space, followed by a 4 digit number.

You get:

[1] "UPDATE: 18. März 2008"      "UPDATE: 14. März 2008"     
[3] "UPDATE: 13. März 2008"      "UPDATE: 14. März 2008"     
[5] "UPDATE: 28. Februar 2008"   "UPDATE: 20. Februar 2008" 
...
[189] "UPDATE: 31. Dezember 2004"      "UPDATE: 3. Januar 2005"        
[191] "UPDATE: 9. Dezember 2004"       "UPDATE: 23. November 2004"

Thanks for the answer. But shouldnt it say "sequence of 2 to 8 letters" in the explanation of the gregexpr() code. Sadly I cant edit that because i can only edit more than 6 letters. — Marvin Schopf, Nov 02 '16 at 07:00

Extracting unknown dates from txt/HTML files using R

1 Answers1