10

How do I take a cell in Excel, which has text that is hyperlinked, and extract the hyperlink part?

mpettis
  • 3,222
  • 4
  • 28
  • 35
rrs
  • 9,615
  • 4
  • 28
  • 38
  • Do you just need to identify if a cell contains a hyperlink? – hrbrmstr Jun 10 '14 at 20:07
  • 1
    Can you read in the file and use gsub() or a variant to locate and extract the hyperlinks? – lawyeR Jun 10 '14 at 20:14
  • With `xlsx`, once you have something akin to a `cells` variable after a call to `getCells(…)`, then you can do a `getCallValue` setting `encoding` to `"unknown"` and then use various other R code to test if it's a hyperlink. – hrbrmstr Jun 10 '14 at 20:21
  • I submitted this question as I was running out the door. Perhaps I could have done a better job. I know which column contains hyperlinks. When I read the Excel file, all I get is the text, but I need to extract the address/URL. – rrs Jun 10 '14 at 20:40
  • You can use the regex from [this SO thread](http://stackoverflow.com/a/163684/1457051) to find and extract pretty much any URI/URL. – hrbrmstr Jun 10 '14 at 20:50
  • what do you mean by "I need to extract the address"? How is that different from getting the text? Are you looking for `browseURL`, `download.file`, or `RCurl::getURL`? – GSee Jun 10 '14 at 22:32
  • @GSee no, the Excel file contains hyperlinks in one of the columns, meaning there's text that you can click on which will take you to a file or URL. I need to extract this address in R, but I'm only getting whatever is visible when you open the file in Excel. – rrs Jun 10 '14 at 22:43
  • This just bit me. With @hrbrmstr 's comments, I was able to come up with a solution that needs paring down, but is a complete reprex. It is here: https://stackoverflow.com/q/70013136/1022967 – mpettis Nov 18 '21 at 00:25

2 Answers2

12

I found a super convoluted way to extract the hyperlinks:

library(XML)

# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)

# unzip the file
unzip(my.zip.file)

# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")

# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")

Derived from this blogpost.

rrs
  • 9,615
  • 4
  • 28
  • 38
  • 1
    (a) great find; (b) not so convoluted. [RExcelXML](http://www.omegahat.org/RExcelXML/) does something similar. Don't forget to mark your own question as answered! – hrbrmstr Jun 11 '14 at 00:37
  • 2
    It's clear from your comments what packages you're using, but you should still include the appropriate preamble (*i.e.* `library(XML)`) in your answer. – Hugh Jun 11 '14 at 01:48
  • How did you read your excel file? – pissall Dec 13 '17 at 08:54
  • 1
    my.excel.file = "your_excel_file.xlsx" at that moment he hasn´t read the excel file yet, just created a character vector with the same name of his excel file, but with .zip as the file extension. Then he creates a copy of its excel file but to the excel file wit the .zip extension, then he unzips it and then he reads sheet1 as an xml with xml <- xmlParse("xl/worksheets/sheet1.xml") – csmontt Jan 19 '18 at 21:04
0

Inspired by @rrs's post above, I put together a function to extract all hyperlinks in a workbook with reference (sheet!A1 format), with which you can look up the hyperlink in any cell.

**Some notes explain the snippet below: **

-- Hyperlinks are stored in two files after unzipping the excel file, (a) xl/worksheets/sheet1.xml and xl/worksheets/_rels/sheet1.xml.res. While the xml one has cell position to rId (relationship id?) table, the xml.res one has the actually rId and hyperlinks table

-- embedded function read_relationships parse XML and join them

-- temp_base_dir is used host the file operation

-- map_df stack hyperlinks from all tabs together

-- Output dataframe has 5 columns. i.e. id (relationship id), target (the hyperlink), ref (cell reference in sheet!A1 format), tab_idx (sheet index), tab (sheet name)

Snippet:

library(tidyverse)
library(XML)
extract_hyperlinks_from_excel <- function(aExcelFile, aRefOutputFile = NULL){
  sheets <- readxl::excel_sheets(aExcelFile)
  
  read_relationships <- function(aSheetIndex){
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', '_rels', paste0('sheet', aSheetIndex, '.xml.rels'))    
    rel <- xmlParse(filename)
    rel <- xmlToList(rel)
    rel <- purrr::map_dfr(rel, as.list)
    rel <- rel[, c('Id', 'Target')]
    names(rel) <- c('id', 'target')
    if(nrow(rel) == 0){
      return(NULL)
    }
    
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', paste0('sheet', aSheetIndex, '.xml'))    
    pos <- xmlParse(filename)
    pos <- xmlToList(pos)
    if(is.null(pos$hyperlinks)){
      return(NULL)
    }
    
    pos <- purrr::map_dfr(pos$hyperlinks, as.list)
    pos <- pos[, c('ref', 'id')]
    
    ret <- inner_join(rel, pos, by = 'id')
    ret$tab_idx <- aSheetIndex
    
    return(ret)
  }
  
  EXCEL_TEMP_NAME <- 'unzipped_excel'
  
  tmp_base_dir <- file.path(tempdir(),
                            paste0('tmpexcl', 
                                   as.character(round(runif(1, 1000000000000, 9999999999999)))))
  dir.create(tmp_base_dir)
  on.exit(unlink(tmp_base_dir))
  
  zipfile <- file.path(tmp_base_dir, paste0(EXCEL_TEMP_NAME, '.zip'))
  file.copy(from = aExcelFile, to = zipfile)
  unzip(zipfile, exdir = tmp_base_dir)
  
  ret <- map_df(seq_along(sheets), read_relationships)
  
  ret %>% 
    mutate(tab = sheets[tab_idx]) %>% 
    mutate(ref = paste0("'", tab, "'!", ref)) %>% 
    select(id,tab_idx, tab, ref, target)  ->
    ret

  if(!is.null(aRefOutputFile)){
    write_csv(ret, aRefOutputFile)
  }
  
  return(ret)
}
Sean Yang
  • 48
  • 6