I have an Excel file containing a large number of hyperlinks, and I want to write a program that extracts the URLs and matches them with the displayed text. I can import the URLs with the solution to a previous question, which uses the following code:
library(XML)
# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)
# unzip the file
unzip(my.zip.file)
# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")
# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")
However, this ignores rows without any links, so the imported dataset is several thousand rows shorter than it should be. I can get the displayed text with read.xlsx
, but I don't know how to match it with the URLs. I've tried looking for ways to find out which rows have links, or to change the code so it adds NAs in the right places, but I haven't had any success.