2

I have got .xls file with images in cells, like so: Excel rows

When I loaded this file in pandas
>>> import pandas as pd
>>> df = pd.read_excel('myfile.xls') # same behaviour with *.xlsx
>>> df.dtypes
The dtype in all columns appeared as object

After some manipulations I saved the df back to excel, however the images disappeared.
Please note that, in excel, I was able to sort the rows simultaneously with the images, and by resizing cells, images scaled accordingly so it looks like they were really contained in the cells.

Why did they disappear after saving df back to excel, or didnt they load into the df in the first place?

redacted
  • 3,789
  • 6
  • 25
  • 38
  • `pandas` uses `xlrd` to read excel files, from this [answer](http://stackoverflow.com/questions/13448508/python-parsing-xls-with-images) sounds like the images are ignored – chrisb Sep 20 '15 at 15:19
  • When I read an excel file with images into pandas DataFrame, the cells with the images are always NaN. An image could be loaded separately as an object of some sort and then inserted into a DataFrame. A way to do this is with PIL.Image.open("image.png"). Usually images would be processed with libraries other than pandas. –  Sep 20 '15 at 23:42

1 Answers1

1

I'm not sure if this will be helpful, but I had the problem where I needed to load a data frame with images, so I wrote the following code. I hope this helps.

import base64
from io import BytesIO

import openpyxl
import pandas as pd

from openpyxl_image_loader import SheetImageLoader

def load_dataframe(dataframe_file_path: str, dataframe_sheet_name: str) -> pd.DataFrame:
    # By default, it appears that pandas does not read images, as it uses only openpyxl to read
    # the file.  As a result we need to load into memory the dataframe and explicitly load in
    # the images, and then convert all of this to HTML and put it back into the normal 
    # dataframe, ready for use.
    pxl_doc = openpyxl.load_workbook(dataframe_file_path)
    pxl_sheet = pxl_doc[dataframe_sheet_name]
    pxl_image_loader = SheetImageLoader(pxl_sheet)
    pd_df = pd.read_excel(dataframe_file_path, sheet_name=dataframe_sheet_name)
    for pd_row_idx, pd_row_data in pd_df.iterrows():
        for pd_column_idx, _pd_cell_data in enumerate(pd_row_data):
            # Offset as openpyxl sheets index by one, and also offset the row index by one more to account for the header row
            pxl_cell_coord_str = pxl_sheet.cell(pd_row_idx + 2, pd_column_idx + 1).coordinate
            if pxl_image_loader.image_in(pxl_cell_coord_str):
            # Now that we have a cell that contains an image, we want to convert it to
            # base64, and it make it nice and HTML, so that it loads in a front end
                pxl_pil_img = pxl_image_loader.get(pxl_cell_coord_str)
                with BytesIO() as pxl_pil_buffered:
                    pxl_pil_img.save(pxl_pil_buffered, format="PNG")
                    pxl_pil_img_b64_str = base64.b64encode(pxl_pil_buffered.getvalue())
                    pd_df.iat[pd_row_idx, pd_column_idx] = '<img src="data:image/png;base64,' + \
                                                                pxl_pil_img_b64_str.decode('utf-8') + \
                                                                f'" alt="{pxl_cell_coord_str}" />'
    return pd_df


NOTE: For some reason, SheetImageLoader's loading of images is persistent globally. This means that when I run this function twice, the second time I run it, openpyxl will append the images from the second run into the SheetImageLoader object of the first.

For example, if I read one file that has 25 images in it, pxl_sheet._images and pxl_image_loader._images both have 25 images in them. However, if I read another file which has 5 images in it, pxl_sheet._images has length 5, but pxl_image_loader._images now has length 30, so it has just appended the new images to the old object, despite being a completely different function call.

I tried deleting the object from memory, but this did not work. I eventually solved this by adding some code in where, after I construct the SheetImageLoader object, I manually reset pxl_image_loader's _images attribute (using a logic similar to that in SheetImageLoader's __init__ method). I'm unsure if this is a bug in openpyxl or something to do with scoping in Python.

Jake Ireland
  • 543
  • 4
  • 11