16

I have an Excel spreadsheet that I am reading into a Pandas DataFrame:

df = pd.read_excel("file.xls")

However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?

slaw
  • 6,591
  • 16
  • 56
  • 109
  • I'm not sure what you're looking for or how a sample row would help. Imagine an Excel spreadsheet with only a single cell with text. And that text is a clickable hyperlink in Excel. The text could be the letter "A" and be clickable in Excel but `pd.read_excel` only grabs the letter "A" and not the underlying hyperlink. – slaw Feb 10 '16 at 21:04
  • [xlsx_cells](https://pyjanitor-devs.github.io/pyjanitor/api/io/#janitor.io.xlsx_cells) might be helpful here – sammywemmy May 03 '23 at 09:55

4 Answers4

15

This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:

import openpyxl

wb = openpyxl.load_workbook('yourfile.xlsm')
sheets = wb.sheetnames
ws = wb[sheets[0]]
# Deprecation warning
# ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)

You can also use iPython, and set a variable equal to the hyperlink object:

t = ws.cell(row=2, column=1).hyperlink

then do t. and press tab to see all the options for what you can do with or access from the object.

Anmol Deep
  • 463
  • 1
  • 5
  • 16
wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
3

Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only, keep_links, did not helped, only changing read_only resulted ok, I suppose it can slow down your code speed).

P.S.: Works only with xlsx, i.e., engine is openpyxl

P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell and load_workbook at pandas.io.excel._openpyxl and update them accordingly.

import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar


def _convert_cell(self, cell, convert_float: bool) -> Scalar:
    from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
    # here we adding this hyperlink support:
    if cell.hyperlink and cell.hyperlink.target:
        return cell.hyperlink.target
        # just for example, you able to return both value and hyperlink,
        # comment return above and uncomment return below
        # btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
        # return f'{cell.value}|||{cell.hyperlink.target}'
    # here starts original code, except for "if" became "elif"
    elif cell.is_date:
        return cell.value
    elif cell.data_type == TYPE_ERROR:
        return np.nan
    elif cell.data_type == TYPE_BOOL:
        return bool(cell.value)
    elif cell.value is None:
        return ""  # compat with xlrd
    elif cell.data_type == TYPE_NUMERIC:
        # GH5394
        if convert_float:
            val = int(cell.value)
            if val == cell.value:
                return val
        else:
            return float(cell.value)

    return cell.value


def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
    from openpyxl import load_workbook
    # had to change read_only to False:
    return load_workbook(
        filepath_or_buffer, read_only=False, data_only=True, keep_links=False
    )


OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook

And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)

After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^

Hellohowdododo
  • 396
  • 3
  • 12
  • Updated code, so it would work for pandas 1.4.0, see github issue link: https://github.com/pandas-dev/pandas/issues/13439#issuecomment-1025641177 – Hellohowdododo Jan 31 '22 at 11:35
1

as commented by slaw it doesnt grab the hyperlink but only the text

here text.xlsx contains links in the 9th column

from openpyxl import load_workbook
workbook = load_workbook('test.xlsx')
worksheet = workbook.active

column_indices = [9]

for row in range(2, worksheet.max_row + 1):
    for col in column_indices:
        filelocation = worksheet.cell(column=col, row=row)  # this is hyperlink
        text = worksheet.cell(column=col + 1, row=row)  # thi is your text 
        worksheet.cell(column=col + 1, row=row).value = '=HYPERLINK("' + filelocation.value + '","' + text.value + '")'

workbook.save('test.xlsx')
matan h
  • 900
  • 1
  • 10
  • 19
  • 1
    That's openpyxl right? And why are you writing the worksheet? I thought we want to read the links out of it? Also, the hyperlink is in the same cell, not the next column over. The text itself has a hyperlink. – wordsforthewise Aug 01 '17 at 19:59
0

You cannot do that in pandas. You can try with other libraries designed to deal with excel files.

Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56