1

I have a dataframe that contains a column with URL links, I want each of the images displayed.

I tried the following solution for local files but it didn't work for URL links. Spark using PySpark read images

If anyone knows how to accomplish this for a pyspark dataframe using an URL link, please do share.

Example of url jpg: https://steemitimages.com/DQmWSoXZPHH2XEuVRUbPqiPLf6niA2xfvFXYZ2FYPYhMQ4X/1%20(3).jpg

Maria Nazari
  • 660
  • 1
  • 9
  • 27
  • Hi loading image only work for local path or hdfs like path. You can only download this image to local disk then load it . – howie May 29 '19 at 00:48

1 Answers1

1

Loading image only work for local path or hdfs like path. You can only download this image to local disk then load it .


import urllib.request

# path to your image source directory
sample_img_dir = /tmp/images

urllib.request.urlretrieve(' https://steemitimages.com/DQmWSoXZPHH2XEuVRUbPqiPLf6niA2xfvFXYZ2FYPYhMQ4X/1%20(3).jpg', sample_img_dir+'/image1.jpg')


# Read image data using new image scheme
image_df = spark.read.format("image")\
           .option("dropInvalid", true)\
           .load(sample_img_dir)


image_df.select("image.origin", "image.width", "image.height").show(truncate=False)
+-------------------------------------------+-----+------+
|origin                                     |width|height|
+-------------------------------------------+-----+------+
|file:///tmp/images/image1.jpg              |300  |311   |
|file:///tmp/images/image2.jpg              |199  |313   |
|file:///tmp/images/image3.jpg              |300  |200   |
|file:///tmp/images/image4.jpg              |300  |296   |
+-------------------------------------------+-----+------+

Reference:

howie
  • 2,587
  • 3
  • 27
  • 43