Extract hyperlink from Excel file in R

Question

How do I take a cell in Excel, which has text that is hyperlinked, and extract the hyperlink part?

Do you just need to identify if a cell contains a hyperlink? — hrbrmstr, Jun 10 '14 at 20:07
Can you read in the file and use gsub() or a variant to locate and extract the hyperlinks? — lawyeR, Jun 10 '14 at 20:14
With `xlsx`, once you have something akin to a `cells` variable after a call to `getCells(…)`, then you can do a `getCallValue` setting `encoding` to `"unknown"` and then use various other R code to test if it's a hyperlink. — hrbrmstr, Jun 10 '14 at 20:21
I submitted this question as I was running out the door. Perhaps I could have done a better job. I know which column contains hyperlinks. When I read the Excel file, all I get is the text, but I need to extract the address/URL. — rrs, Jun 10 '14 at 20:40
You can use the regex from [this SO thread](http://stackoverflow.com/a/163684/1457051) to find and extract pretty much any URI/URL. — hrbrmstr, Jun 10 '14 at 20:50
what do you mean by "I need to extract the address"? How is that different from getting the text? Are you looking for `browseURL`, `download.file`, or `RCurl::getURL`? — GSee, Jun 10 '14 at 22:32
@GSee no, the Excel file contains hyperlinks in one of the columns, meaning there's text that you can click on which will take you to a file or URL. I need to extract this address in R, but I'm only getting whatever is visible when you open the file in Excel. — rrs, Jun 10 '14 at 22:43
This just bit me. With @hrbrmstr 's comments, I was able to come up with a solution that needs paring down, but is a complete reprex. It is here: https://stackoverflow.com/q/70013136/1022967 — mpettis, Nov 18 '21 at 00:25

rrs · Accepted Answer · 2014-06-11T02:09:47.087

12

I found a super convoluted way to extract the hyperlinks:

library(XML)

# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)

# unzip the file
unzip(my.zip.file)

# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")

# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")

Derived from this blogpost.

edited Jun 11 '14 at 02:09

answered Jun 10 '14 at 22:47

rrs

9,615
4
28
38

1

(a) great find; (b) not so convoluted. [RExcelXML](http://www.omegahat.org/RExcelXML/) does something similar. Don't forget to mark your own question as answered! – hrbrmstr Jun 11 '14 at 00:37
2

It's clear from your comments what packages you're using, but you should still include the appropriate preamble (*i.e.* `library(XML)`) in your answer. – Hugh Jun 11 '14 at 01:48
How did you read your excel file? – pissall Dec 13 '17 at 08:54
1

my.excel.file = "your_excel_file.xlsx" at that moment he hasn´t read the excel file yet, just created a character vector with the same name of his excel file, but with .zip as the file extension. Then he creates a copy of its excel file but to the excel file wit the .zip extension, then he unzips it and then he reads sheet1 as an xml with xml <- xmlParse("xl/worksheets/sheet1.xml") – csmontt Jan 19 '18 at 21:04

Sean Yang · Answer 2 · 2022-03-10T16:53:41.240

Inspired by @rrs's post above, I put together a function to extract all hyperlinks in a workbook with reference (sheet!A1 format), with which you can look up the hyperlink in any cell.

**Some notes explain the snippet below: **

-- Hyperlinks are stored in two files after unzipping the excel file, (a) xl/worksheets/sheet1.xml and xl/worksheets/_rels/sheet1.xml.res. While the xml one has cell position to rId (relationship id?) table, the xml.res one has the actually rId and hyperlinks table

-- embedded function read_relationships parse XML and join them

-- temp_base_dir is used host the file operation

-- map_df stack hyperlinks from all tabs together

-- Output dataframe has 5 columns. i.e. id (relationship id), target (the hyperlink), ref (cell reference in sheet!A1 format), tab_idx (sheet index), tab (sheet name)

Snippet:

library(tidyverse)
library(XML)
extract_hyperlinks_from_excel <- function(aExcelFile, aRefOutputFile = NULL){
  sheets <- readxl::excel_sheets(aExcelFile)
  
  read_relationships <- function(aSheetIndex){
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', '_rels', paste0('sheet', aSheetIndex, '.xml.rels'))    
    rel <- xmlParse(filename)
    rel <- xmlToList(rel)
    rel <- purrr::map_dfr(rel, as.list)
    rel <- rel[, c('Id', 'Target')]
    names(rel) <- c('id', 'target')
    if(nrow(rel) == 0){
      return(NULL)
    }
    
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', paste0('sheet', aSheetIndex, '.xml'))    
    pos <- xmlParse(filename)
    pos <- xmlToList(pos)
    if(is.null(pos$hyperlinks)){
      return(NULL)
    }
    
    pos <- purrr::map_dfr(pos$hyperlinks, as.list)
    pos <- pos[, c('ref', 'id')]
    
    ret <- inner_join(rel, pos, by = 'id')
    ret$tab_idx <- aSheetIndex
    
    return(ret)
  }
  
  EXCEL_TEMP_NAME <- 'unzipped_excel'
  
  tmp_base_dir <- file.path(tempdir(),
                            paste0('tmpexcl', 
                                   as.character(round(runif(1, 1000000000000, 9999999999999)))))
  dir.create(tmp_base_dir)
  on.exit(unlink(tmp_base_dir))
  
  zipfile <- file.path(tmp_base_dir, paste0(EXCEL_TEMP_NAME, '.zip'))
  file.copy(from = aExcelFile, to = zipfile)
  unzip(zipfile, exdir = tmp_base_dir)
  
  ret <- map_df(seq_along(sheets), read_relationships)
  
  ret %>% 
    mutate(tab = sheets[tab_idx]) %>% 
    mutate(ref = paste0("'", tab, "'!", ref)) %>% 
    select(id,tab_idx, tab, ref, target)  ->
    ret

  if(!is.null(aRefOutputFile)){
    write_csv(ret, aRefOutputFile)
  }
  
  return(ret)
}

I tried this with [my own use case](https://stackoverflow.com/questions/62506660) and the map_df function returned `Error: XML content does not seem to be XML: '' `. — NotReallyHere12, Jan 13 '22 at 19:47
What is the version of your Excel file. The code works on XLSX format. — Sean Yang, Mar 02 '22 at 15:34
@NotReallyHere12. It seems the error is related to either hidden sheet in Excel or temporary folder. I updated the code. Hope it works on your ends. — Sean Yang, Mar 10 '22 at 16:56
Hello @SeanYang, is there a way to make it work with "XLS" files too? — AnonX, Nov 17 '22 at 17:14
@AuronusBen: XLS is binary file which can not be parsed, whereas XLSX is XML/zip which is text based. — Sean Yang, Nov 18 '22 at 14:15
@SeanYang Thanks for this clarification. So there's no way to extract hyperlinks from a XLS file? — AnonX, Nov 21 '22 at 08:12

Extract hyperlink from Excel file in R

2 Answers2

Linked