1

I generated a power point deck with a utility script in databricks using Python. I want to access the file now in the kernel but due to the images in the deck, it shows strange symbols. How do I correct this statement which outputs the deck image?

#access file
dbutils.fs.head('file:/dbfs/user/test.pptx')

Out: 'PK\x03\x04\x14\x00\x00\x00\x08\x00D�lOƯ�g�\x01\x00\x00�\x0c\x00\x00\x13\x00\x00\x00[Content_Types].xml͗�N�0\x10��<E�K\x0e�q�\x175��rb�\x04<�I����-ϴзg�.��R�\n_\x12�3���\'Q4霼�:\x1a�GeM�l��$\x02��B�A���]�\x0e�\x08I�Bjk K&��Iw�s7q�\x11\x17\x1b��!�;\x16\x02�!
num3ri
  • 822
  • 16
  • 20
LaLaTi
  • 1,455
  • 3
  • 18
  • 31

2 Answers2

0

How to display a pptx file from databricks?

To display a pptx file from databricks using below code:

from pptx import Presentation
prs = Presentation('/dbfs/myfolder/BRK4024.pptx')
for slide in prs.slides:
  for shapes in slide.shapes:
    print( shapes.shape_type )
    print( '----------------' )
    if shapes.has_text_frame:
      print( shapes.text )

Notebook sample:

enter image description here

Note: In the output you will see ("PlaceHolders", "AutoShapes", "Pictures") because python-pptx does not support SmartArt. You need to manually insert content into a placeholder/AutoShapes/Pictures, which will be overhead task to build in python.

Example: Sample code - add an image in every Powerpoint slide using python-pptx

How to download a pptx file from databricks?

You can use databricks cli to download files from databricks file system to local machine as follows;

dbfs cp dbfs:/myfolder/BRK4024.pptx A:DataSet\

Example: Since I have a sample BRK4024.pptx file in myfolder on dbfs, I'm using databricks cli command to copy to local machine folder name (A:Dataset)

enter image description here

Hope this helps.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
0

Just additionally answer for the partial question How to display a pptx file from databricks?.

Ofcouse, I see @CHEEKATLAPRADEEP-MSFT has answered for how to use python-pptx to extract the text content of a pptx file and show in the databricks notebook.

However, if you want to display the whole slides of a pptx file as images in the databricks notebook like the blog Converting presentation slides to HTML blog post with images did, it's impossible in the databricks notebook, the reason as below.

  1. Databricks is running in Linux, so you can not convert a pptx file to images via win32 api for invoking MS PowerPoint Application.
  2. The existing solution for converting pptx to images requires to install LibraOffice in the running machine, but I'm afraid that you can not do that on Linux OS for cloud databricks. Due to the issue https://github.com/scanny/python-pptx/issues/348, python-pptx can not do the conversion. Even there is not any Python package can do it alone.

If the databricks you used is a private machine, you may try to follow the SO thread How to convert pptx files to jpg or png (for each slide) on linux? or the code from https://github.com/innaky/pptx-to-images/blob/master/pptx-to-images.py to get the images of slides of a pptx file, then you can refer to the section Display images of the databricks docuemnt Use Notebooks to display them.

Sure, also you can upload images converted from a pptx file on local to cloud databricks, then to display them. But automatically to do these completely on cloud databricks seems to be impossible.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43