5

I have a large excel file with several sheets of data that I need to convert to HTML. I was excited to try Pandas to help streamline the conversion and keep from saving the Excel sheets as HTML and then spending all day removing all the horrific MS tags.

I was able to read the Excel file + sheets and then load them as a dataframe. The only problem is that it is stripping all the hyperlinks out of the cells. I've looked all over but cannot find an answer on retaining hyperlinks. This is the first time I've used Pandas, so it could simply be inexperience. Below is my code and a screenshot of the output. Thank you for your help.

In [2]: import pandas as pd

In [3]: xls_file = pd.ExcelFile('Desktop/cfec-temp/blackbook/blackbook.xlsx')

In [4]: xls_file
Out[4]: <pandas.io.excel.ExcelFile at 0x1132ce4e0>

In [5]: xls_file.sheet_names
Out[5]: ['Sheet1', 'Sheet2', 'Sheet3', 'Sheet4', 'Sheet5', 'Sheet6', 'Sheet7']

In [6]: df = xls_file.parse('Sheet1')

In [7]: df

enter image description here

Charles Smith
  • 3,201
  • 4
  • 36
  • 80
  • This question was asked here [http://stackoverflow.com/questions/35325799/pandas-read-excel-with-hyperlink](http://stackoverflow.com/questions/35325799/pandas-read-excel-with-hyperlink) – omadison Sep 12 '16 at 21:00
  • 1
    Practically all data objects (database tables, pandas dfs, r dfs, sas datasets, etc.) primarily handles textual or numeric data types. In pandas, there is no dtype for hyperlinks, images, and any other special binary objects of Excel. To retain links you should add a column that [converts the hyperlink address](http://superuser.com/questions/430260/excel-convert-a-column-with-hyperlinks-to-just-display-the-links) to a new column. – Parfait Sep 13 '16 at 02:01
  • Would be great if Pandas could handle it, but this is a suitable and non-complicated compromise. Thank you. – Charles Smith Sep 13 '16 at 04:30

1 Answers1

3

What I would do is use openpyxl to get the hyperlinks, then iterate through the rows in a for loop, creating a list of the hyperlinks, and add a new column in your pandas dataframe:

import openpxyl
import pandas as pd

df = pd.read_excel('file.xlsm')

wb = openpyxl.load_workbook('yourfile.xlsm')
ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)

links = []
for i in range(2, ws.max_row + 1):  # 2nd arg in range() not inclusive, so add 1
    links.append(ws.cell(row=i, column=1).hyperlink.target)

df['link'] = links

Same first bit of code as I answered in the other question

wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
  • 2
    typo: import openpyxl AND wb.get_sheet_by_name is depreciated Use wb[sheetname] – CJD Nov 15 '18 at 13:10