6

I have a bunch of scripts that do web scraping, download files, and read them with pandas. This process has to be deployed in a new architecture where download the files on disk is not appropriate, instead is preferable to save the file in memory and read it with pandas from there. For demonstration purposes I leave here a web scraping script that downloads an excel file from a random website:

import time
import pandas as pd
from io import StringIO, BytesIO
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from datetime import date, timedelta
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


pathDriver = #Path to chromedriver

driver  = webdriver.Chrome(executable_path=pathDriver)

url = 'https://file-examples.com/index.php/sample-documents-download/sample-xls-download/'

driver.get(url)
time.sleep(1)

file_link = driver.find_element_by_xpath('//*[@id="table-files"]/tbody/tr[1]/td[5]/a[1]')
file_link.click()

This script effectively downloads the file in my Downloads folder. What I've tried is to put a StringIO() or BytesIO() stream before and after the click() method and read the object similiar to this:

file_object = StringIO()
df = pd.read_excel(file_object.read())

But the file_object doesn't capture the file and even the file is still downloaded in my disk.

Any suggestions with that?

David López
  • 500
  • 5
  • 21
  • 1
    Instead of `file_link.click()` can you find what is the `href` of the link? That way you can do `df = pd.read_excel(href)` or download it to a temp folder / memory with `requests` – Code Different Oct 31 '20 at 13:18
  • Hi @CodeDifferent, unfortunately, the "href" attribute is not present in all Html elements that I have in my scrapper's scripts, so is not a scalable solution. – David López Nov 03 '20 at 14:23
  • 1
    @DavidLópez please provide another use case where the "href" attribute isn't present. – Life is complex Nov 04 '20 at 17:30
  • 1
    either create a ramdisk or use a script injection to intercept the request/response. – Florent B. Nov 04 '20 at 20:15
  • 1
    Scraping is the word that better fits the context of this question. Note that scrapping and scraping have significantly [different meaning](https://stackoverflow.com/a/42842376/14349691). Would have edited the question myself, but apparently there is a requirement on edits having to change at least 6 characters. – Attila Viniczai Nov 09 '20 at 23:37

2 Answers2

8

Your question can be accomplished by adding the selenium add_experimental_option. I also redesigned your code to loop through the table to extract the href to pass them to StringIO. No files are downloaded to my local system using this code.

If I missed something please let me know.

import pandas as pd
from time import sleep
from io import StringIO
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

capabilities = DesiredCapabilities().CHROME

chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")

prefs = {
    'profile.default_content_setting_values':
     {
        'automatic_downloads': 0
  },

      'profile.content_settings.exceptions':
    {
        'automatic_downloads': 0
    }
  }

chrome_options.add_experimental_option('prefs', prefs)
capabilities.update(chrome_options.to_capabilities())

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)

url_main = 'https://file-examples.com/index.php/sample-documents-download/sample-xls-download/'

driver.get(url_main)

elements = driver.find_elements_by_xpath('//*[@id="table-files"]//td/a')
for element in elements:
   if str(element.get_attribute("href")).endswith('.xls'):
     file_object = StringIO(element.get_attribute("href"))
      xls_file = file_object.read()
      df = pd.read_excel(xls_file)
      print(df.to_string(index=False))

        First Name  Last Name  Gender        Country  Age        Date    Id
      1      Dulce      Abril  Female  United States   32  15/10/2017  1562
      2       Mara  Hashimoto  Female  Great Britain   25  16/08/2016  1582
      3     Philip       Gent    Male         France   36  21/05/2015  2587
      4   Kathleen     Hanner  Female  United States   25  15/10/2017  3549
      5    Nereida    Magwood  Female  United States   58  16/08/2016  2468
      6     Gaston      Brumm    Male  United States   24  21/05/2015  2554
      7       Etta       Hurn  Female  Great Britain   56  15/10/2017  3598
      8    Earlean     Melgar  Female  United States   27  16/08/2016  2456
      9   Vincenza    Weiland  Female  United States   40  21/05/2015  6548
      
      sleep(360)

Here is an example using a RAMDISK that was mentioned in the comments. This option does not use selenium add_experimental_option or StringIO.

import fs
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)

url_main = 'https://file-examples.com/index.php/sample-documents-download/sample-xls-download/'

driver.get(url_main)

urls_to_process = []
elements = driver.find_elements_by_xpath('//*[@id="table-files"]//td/a')

# Create RAMDISK
mem_fs = fs.open_fs('mem://')
mem_fs.makedir('hidden_dir')

for element in elements:
  if str(element.get_attribute("href")).endswith('.xls'):
     with mem_fs.open('hidden_dir/file1.csv', 'w') as in_file:
        in_file.write(element.get_attribute("href"))
        in_file.close()
     with mem_fs.open('hidden_dir/file1.csv', 'r') as out_file:
        df = pd.read_excel(out_file.read())
        print(df.to_string(index=False))
        # same output as above
        sleep(360)
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Hi, @Life is complex. The solution you provide works fine for the example I gave, but unfortunately, as I explained above, I have to scrape a lot of webs, and most of them don't have the "href" element in the button that downloads the desired file, so this solution doesn't scale well. I'm searching for a way to obtain the URL or endpoint that the button "actives". Is like the "href" element were hidden, so If I achieve to find this I could applied the approach you suggested to all my scrappers: file_object = StringIO(my_desired url) – David López Nov 03 '20 at 21:23
  • @DavidLópez You need to provide other use cases, because I can only develop an answer from the question and its data elements. Without these additional data elements I don't know how my answer scales across multiple websites. – Life is complex Nov 03 '20 at 22:08
  • can you create this RAMDISK and then set that as your download directory? Using download.default_directory? – DMart Nov 05 '20 at 02:02
  • 1
    @DMart that is what is happening in the RAMDISK answer. I'm creating a disk in memory and writing and reading from the disk. – Life is complex Nov 05 '20 at 02:05
  • sorry, i didn't see in your code where you set the download directory. – DMart Nov 05 '20 at 13:18
  • @DMart here is the documentation on the module - https://docs.pyfilesystem.org/en/latest – Life is complex Nov 05 '20 at 13:40
  • @Lifeiscomplex is there a way to, for example, download the file and save in this RAMDISK? I'm am thinking in create the RAMDISK and tell chrome driver to save the file in this location – David López Nov 18 '20 at 21:39
  • Yes, you can do this with a few extra lines of code. Do you need me to update the answer? – Life is complex Nov 18 '20 at 22:07
  • @Lifeiscomplex that would be great! – David López Nov 19 '20 at 00:24
  • @DavidLópez this answer that I wrote has the details on using RAMDISK, https://stackoverflow.com/questions/64638010/compare-csv-files-content-with-filecmp-and-ignore-metadata/64763745#64763745 – Life is complex Nov 21 '20 at 13:31
  • @Lifeiscomplex will appreciate if you check on this question that has to download a file without a file link. [ https://stackoverflow.com/questions/71080967/python-selenium-to-download-file-to-memory ] – Tiger Strom Feb 11 '22 at 13:40
2

IMO, using selenium is clearly unnecessary.Only using requests + beautifulsoup + pandas is okay.(this would much faster than using selenium,and need less code).

Code below:

from io import BytesIO
import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get("https://file-examples.com/index.php/sample-documents-download/sample-xls-download/")
soup = BeautifulSoup(response.text, "html.parser")
# get the download link
file_link = soup.select_one(".file-link > a").get("href")

# download it in memory
bytes_of_file = requests.get(file_link).content
df = pd.read_excel(BytesIO(bytes_of_file))
print(df)

Result:

   0 First Name  Last Name  Gender        Country  Age        Date    Id
0  1      Dulce      Abril  Female  United States   32  15/10/2017  1562
1  2       Mara  Hashimoto  Female  Great Britain   25  16/08/2016  1582
2  3     Philip       Gent    Male         France   36  21/05/2015  2587
3  4   Kathleen     Hanner  Female  United States   25  15/10/2017  3549
4  5    Nereida    Magwood  Female  United States   58  16/08/2016  2468
5  6     Gaston      Brumm    Male  United States   24  21/05/2015  2554
6  7       Etta       Hurn  Female  Great Britain   56  15/10/2017  3598
7  8    Earlean     Melgar  Female  United States   27  16/08/2016  2456
8  9   Vincenza    Weiland  Female  United States   40  21/05/2015  6548

And this wouldn't download any excel files.

jizhihaoSAMA
  • 12,336
  • 9
  • 27
  • 49
  • 1
    Except if you are trying to download something behind services like Cloudflare, which blocks python scripts, curl, nodeJS, and many more. – Dzeri96 Jan 07 '22 at 20:54
  • There are just some legacy websites that don't have any link to any excel file and the only way to download the file is by clicking the button. – Mark Jun 24 '22 at 19:03