0

I am using selenium to scrape image to text from a website, and then put them into a csv file, http://www.58food.com/company_a1245289688.html

No idea if selenium could do that? Many thanks!!

driver.find_element_by_xpath('//*[@class="contact-text"]/d1/img[1]')
driver.find_element_by_xpath('//*[@class="contact-text"]/d1/img[2]')
and then..?
Joyce
  • 435
  • 4
  • 13
  • You need to pull the source image as a file and then encode into base64 to store in the CSV. Storing images in a CSV is a bad idea. The file will soon become enormous. – Vishnudev Krishnadas Feb 07 '21 at 02:45
  • As Vishnudev said but to elaborate a little further. Unless you compress the image (which most images are already compressed to save websites bandwidth and speed up loading times) we can assume at minimum each image would be at least 1mB. Converting this to plaintext in a CSV means each image you store the csv will grow 1mB in size. Is there any reason why you are trying to put the images in a csv as plaintext in the first place or just experimental? – Oddity Feb 07 '21 at 02:58
  • @Oddity Thanks for your help, as the image contains some contact numbers which I’d like to put the numbers in a CSV file for information storage, and I cannot think of a better idea to do this. So is it possible to do that? – Joyce Feb 07 '21 at 03:01
  • @Vishnudev thanks! But may I ask how could I encode them? – Joyce Feb 07 '21 at 03:02
  • Again, encoding is not good for large files. Instead, Just extract contact numbers from images using OCR(Optical Character recognition) and put the contact numbers in the CSV. – Vishnudev Krishnadas Feb 07 '21 at 03:05
  • @Cathy you could use opencv to do OCR like in [this](https://www.geeksforgeeks.org/text-detection-and-extraction-using-opencv-and-ocr/) tutorial. – Oddity Feb 07 '21 at 03:06
  • Well, @Oddity OpenCV doesn't really do OCR, Tesseract does. – Vishnudev Krishnadas Feb 07 '21 at 03:07
  • @Vishnudev The tutorial I linked shows you an example using tesseract. Sorry for not adding that in my comment. – Oddity Feb 07 '21 at 03:12
  • Even after this, If you need the code for conversion of image to base64 https://stackoverflow.com/a/30280565/5120049 – Vishnudev Krishnadas Feb 07 '21 at 03:15
  • Thank you so much both, I am trying the links, thanks! – Joyce Feb 07 '21 at 03:18

1 Answers1

0
import urllib.request
import base64


imgsrc=driver.find_element_by_xpath('//*[@class="contact-text"]/d1/img[1]').get_attribute("src")

 local_filename, headers = urllib.request.urlretrieve(src)
 encoded_string = base64.b64encode(image_file.read())

retrieve the src and then use urllib to get the img, then convert it to base64

PDHide
  • 18,113
  • 2
  • 31
  • 46
  • Thanks for your help! but may I ask is the image_file the file I saved first at my local path (from the website)? So I should save the 'src' first and then do that? – Joyce Feb 07 '21 at 03:32
  • @cathy no localfile variable will have that content , if you want to save you can use it as urllib.request.urlretrieve(src,"a.png") – PDHide Feb 07 '21 at 03:34
  • This will store the img as a.png – PDHide Feb 07 '21 at 03:34
  • but may I ask what `image_file` refer to? says undefined name – Joyce Feb 07 '21 at 03:36