How do I take a cell in Excel, which has text that is hyperlinked, and extract the hyperlink part?
-
Do you just need to identify if a cell contains a hyperlink? – hrbrmstr Jun 10 '14 at 20:07
-
1Can you read in the file and use gsub() or a variant to locate and extract the hyperlinks? – lawyeR Jun 10 '14 at 20:14
-
With `xlsx`, once you have something akin to a `cells` variable after a call to `getCells(…)`, then you can do a `getCallValue` setting `encoding` to `"unknown"` and then use various other R code to test if it's a hyperlink. – hrbrmstr Jun 10 '14 at 20:21
-
I submitted this question as I was running out the door. Perhaps I could have done a better job. I know which column contains hyperlinks. When I read the Excel file, all I get is the text, but I need to extract the address/URL. – rrs Jun 10 '14 at 20:40
-
You can use the regex from [this SO thread](http://stackoverflow.com/a/163684/1457051) to find and extract pretty much any URI/URL. – hrbrmstr Jun 10 '14 at 20:50
-
what do you mean by "I need to extract the address"? How is that different from getting the text? Are you looking for `browseURL`, `download.file`, or `RCurl::getURL`? – GSee Jun 10 '14 at 22:32
-
@GSee no, the Excel file contains hyperlinks in one of the columns, meaning there's text that you can click on which will take you to a file or URL. I need to extract this address in R, but I'm only getting whatever is visible when you open the file in Excel. – rrs Jun 10 '14 at 22:43
-
This just bit me. With @hrbrmstr 's comments, I was able to come up with a solution that needs paring down, but is a complete reprex. It is here: https://stackoverflow.com/q/70013136/1022967 – mpettis Nov 18 '21 at 00:25
2 Answers
I found a super convoluted way to extract the hyperlinks:
library(XML)
# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)
# unzip the file
unzip(my.zip.file)
# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")
# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")
Derived from this blogpost.

- 9,615
- 4
- 28
- 38
-
1(a) great find; (b) not so convoluted. [RExcelXML](http://www.omegahat.org/RExcelXML/) does something similar. Don't forget to mark your own question as answered! – hrbrmstr Jun 11 '14 at 00:37
-
2It's clear from your comments what packages you're using, but you should still include the appropriate preamble (*i.e.* `library(XML)`) in your answer. – Hugh Jun 11 '14 at 01:48
-
-
1my.excel.file = "your_excel_file.xlsx" at that moment he hasn´t read the excel file yet, just created a character vector with the same name of his excel file, but with .zip as the file extension. Then he creates a copy of its excel file but to the excel file wit the .zip extension, then he unzips it and then he reads sheet1 as an xml with xml <- xmlParse("xl/worksheets/sheet1.xml") – csmontt Jan 19 '18 at 21:04
Inspired by @rrs's post above, I put together a function to extract all hyperlinks in a workbook with reference (sheet!A1 format), with which you can look up the hyperlink in any cell.
**Some notes explain the snippet below: **
-- Hyperlinks are stored in two files after unzipping the excel file, (a) xl/worksheets/sheet1.xml and xl/worksheets/_rels/sheet1.xml.res. While the xml one has cell position to rId (relationship id?) table, the xml.res one has the actually rId and hyperlinks table
-- embedded function read_relationships parse XML and join them
-- temp_base_dir is used host the file operation
-- map_df stack hyperlinks from all tabs together
-- Output dataframe has 5 columns. i.e. id (relationship id), target (the hyperlink), ref (cell reference in sheet!A1 format), tab_idx (sheet index), tab (sheet name)
Snippet:
library(tidyverse)
library(XML)
extract_hyperlinks_from_excel <- function(aExcelFile, aRefOutputFile = NULL){
sheets <- readxl::excel_sheets(aExcelFile)
read_relationships <- function(aSheetIndex){
filename <- file.path(tmp_base_dir, 'xl', 'worksheets', '_rels', paste0('sheet', aSheetIndex, '.xml.rels'))
rel <- xmlParse(filename)
rel <- xmlToList(rel)
rel <- purrr::map_dfr(rel, as.list)
rel <- rel[, c('Id', 'Target')]
names(rel) <- c('id', 'target')
if(nrow(rel) == 0){
return(NULL)
}
filename <- file.path(tmp_base_dir, 'xl', 'worksheets', paste0('sheet', aSheetIndex, '.xml'))
pos <- xmlParse(filename)
pos <- xmlToList(pos)
if(is.null(pos$hyperlinks)){
return(NULL)
}
pos <- purrr::map_dfr(pos$hyperlinks, as.list)
pos <- pos[, c('ref', 'id')]
ret <- inner_join(rel, pos, by = 'id')
ret$tab_idx <- aSheetIndex
return(ret)
}
EXCEL_TEMP_NAME <- 'unzipped_excel'
tmp_base_dir <- file.path(tempdir(),
paste0('tmpexcl',
as.character(round(runif(1, 1000000000000, 9999999999999)))))
dir.create(tmp_base_dir)
on.exit(unlink(tmp_base_dir))
zipfile <- file.path(tmp_base_dir, paste0(EXCEL_TEMP_NAME, '.zip'))
file.copy(from = aExcelFile, to = zipfile)
unzip(zipfile, exdir = tmp_base_dir)
ret <- map_df(seq_along(sheets), read_relationships)
ret %>%
mutate(tab = sheets[tab_idx]) %>%
mutate(ref = paste0("'", tab, "'!", ref)) %>%
select(id,tab_idx, tab, ref, target) ->
ret
if(!is.null(aRefOutputFile)){
write_csv(ret, aRefOutputFile)
}
return(ret)
}

- 48
- 6
-
-
I tried this with [my own use case](https://stackoverflow.com/questions/62506660) and the map_df function returned `Error: XML content does not seem to be XML: '' `. – NotReallyHere12 Jan 13 '22 at 19:47
-
What is the version of your Excel file. The code works on XLSX format. – Sean Yang Mar 02 '22 at 15:34
-
-
@NotReallyHere12. It seems the error is related to either hidden sheet in Excel or temporary folder. I updated the code. Hope it works on your ends. – Sean Yang Mar 10 '22 at 16:56
-
-
@AuronusBen: XLS is binary file which can not be parsed, whereas XLSX is XML/zip which is text based. – Sean Yang Nov 18 '22 at 14:15
-
@SeanYang Thanks for this clarification. So there's no way to extract hyperlinks from a XLS file? – AnonX Nov 21 '22 at 08:12