1

Good day

I am a newbie to Stackoverflow:) I am trying my hand with programming with R and found this platform a great source of help.

I have developed some code leveraging stackoverflow, but now I am failing to read the metadata from this htm file

Please direct download this file before using in R

setwd("~/NLP")
library(tm)
library(rvest)
library(tm.plugin.factiva)
file <-read_html("facts.htm")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))

# See the contents of the documents
inspect(corpus)

head(corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

See meta-data associated with first article

meta(corpus[[3]])

meta(corpus[[3]])
  author       : character(0)
  datetimestamp: 2017-08-31
  description  : character(0)
  heading      : Rain, Rain, Rain
  id           : TIMEUK-170830-e
  language     : en
  origin       : thetimes.co.uk
  edition      : character(0)
  section      : Comment
  subject      : c("Hurricanes/Typhoons", "Storms", "Political/General News", "Disasters/Accidents", "Natural Disasters/Catastrophes", "Risk News", "Weather")
  coverage     : c("United States", "North America")
  company      : character(0)
  industry     : character(0)
  infocode     : character(0)
  infodesc     : character(0)
  wordcount    : 333
  publisher    : News UK & Ireland Limited
  rights       : © Times Newspapers Limited 2017

How can I save each metadata (SE, HD, AU, ..PUB, AU) - all 18 metadata elements column-wise in a dataframe or write to excel for each document in corpus?

Example of output:

     SE HD AU ...
Doc 1
    2
    3

Thank you for your help

seg data
  • 31
  • 1
  • 4
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos Aug 31 '17 at 11:57
  • 1
    Take `head(corpus)` and show us the data. If you would like help sorting the data into your groupings we need to know what it looks like. – sconfluentus Aug 31 '17 at 12:06
  • @sconfluentus thank you for your advise - I have added this to the above – seg data Aug 31 '17 at 12:12
  • @Sotos - I hope my changes are to your satisfaction – seg data Aug 31 '17 at 12:12

1 Answers1

1

The simplest way I know of to do it is:

Make a data frame from each of the three lists in your corpus:

one<-data.frame(unlist(meta(corpus[[1]])))
two<-data.frame(unlist(meta(corpus[[2]])))
three<-data.frame(unlist(meta(corpus[[3]])))

Then you will want to merge them into a single data frame. For the first two, this is easy to do, as using "row.names" will cause them to merge on the NON VARIABLE row names. But the second merge, you need to merge based on the column now named "Row.Names" So you need to create and rename the first column of the third file with the row names, using setDT allows you to do this without adding another full set of information, just redirecting R to see the row names as the first column

setDT(three, keep.rownames = TRUE)[]
colnames(three)[1] <- "Row.names"

then you simply merge the first and second data frame into variable named meta, and then merge meta with three using "Row.names" (the new name of the first column now).

meta <- merge(one, two, by="row.names", all=TRUE) 
meta <- merge(meta, three, by = "Row.names", all=TRUE)

Your data will look like this:

  Row.names unlist.meta.corpus..1.... unlist.meta.corpus..2.... unlist.meta.corpus..3....
1    author             Jenni Russell                      <NA>                      <NA>
2 coverage1             United States               North Korea             United States
3 coverage2             North America             United States             North America
4 coverage3                      <NA>                     Japan                      <NA>
5 coverage4                      <NA>                 Pyongyang                      <NA>
6 coverage5                      <NA>              Asia Pacific                      <NA> 

Those NA values are there because not all of the sub-lists had values for all of the observations.

By using the all=TRUE on both merges, you preserve all of the fields, with and without data, which makes it easy to work with moving forward.

If you look at this PDF from CRAN on page two the section Details shows you how to access the content and metadata. From there is is simply about unlisting to move them into data frames.

If you get lost, send a comment and I will do what I can to help you out!

EDIT BY REQUEST:

To write this to Excel is not super difficult because the data is already "square" in a uniform data frame. You would just install xlsx package and xlxsjars then use the following function:

write.xlsx(meta, file, sheetName="Sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE)

You can find information about the package here: page 38 gives more detail. And if you want to save the content, you can change meta to content in the files which extract the data from corpus and make the initial dataframes. The entire process will be the same otherwise

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • Thank @sconfluentus - this is perfect. How can I also save the content of the articles as well in the excel file. Metadata in htm is listed as TD. – seg data Sep 01 '17 at 06:14
  • Thank you @sconfluentus - your note to change to 'meta' to 'content' is very useful – seg data Sep 04 '17 at 04:19