I have a bunch of scripts that do web scraping, download files, and read them with pandas. This process has to be deployed in a new architecture where download the files on disk is not appropriate, instead is preferable to save the file in memory and read it with pandas from there. For demonstration purposes I leave here a web scraping script that downloads an excel file from a random website:
import time
import pandas as pd
from io import StringIO, BytesIO
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from datetime import date, timedelta
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
pathDriver = #Path to chromedriver
driver = webdriver.Chrome(executable_path=pathDriver)
url = 'https://file-examples.com/index.php/sample-documents-download/sample-xls-download/'
driver.get(url)
time.sleep(1)
file_link = driver.find_element_by_xpath('//*[@id="table-files"]/tbody/tr[1]/td[5]/a[1]')
file_link.click()
This script effectively downloads the file in my Downloads folder. What I've tried is to put a StringIO()
or BytesIO()
stream before and after the click()
method and read the object similiar to this:
file_object = StringIO()
df = pd.read_excel(file_object.read())
But the file_object doesn't capture the file and even the file is still downloaded in my disk.
Any suggestions with that?