How can I extract URLs from Excel and match them with their displayed text using R?

Question

I have an Excel file containing a large number of hyperlinks, and I want to write a program that extracts the URLs and matches them with the displayed text. I can import the URLs with the solution to a previous question, which uses the following code:

library(XML)

# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)

# unzip the file
unzip(my.zip.file)

# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")

# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")

However, this ignores rows without any links, so the imported dataset is several thousand rows shorter than it should be. I can get the displayed text with read.xlsx, but I don't know how to match it with the URLs. I've tried looking for ways to find out which rows have links, or to change the code so it adds NAs in the right places, but I haven't had any success.

score 0 · Answer 1 · answered Jan 12 '22 at 05:35

0

Having the same use case today, I dug a little bit and got a R function to extract all hyperlinks beneath the cells/text. My code snippet is posted here, Extract hyperlink from Excel file in R, which I believe is a similar topic:

answered Jan 12 '22 at 05:35

Sean Yang

48
6

How can I extract URLs from Excel and match them with their displayed text using R?

1 Answers1

Linked