I am facing some difficulties in inserting binary values (bytea
) in a PostgreSQL DB using psycopg2
. I am new to Postgres so pardon my ignorance, please.
So, here's the scenario:
- I am scraping my university website (using
BeautifulSoup
) and am retrieving the notification they post in the Notification section. - It's a shitty site, so the notification is posted in just a webpage, with the notification text hyperlinked to a pdf document.
- There are 7 sub-pages that I scrape. I know I am not using the right web-dev terminology, so here's what I mean: The root URL is https://exampleuniversity.com/ and the sub-pages are (in bold) 'https://exampleuniversity.com/univ-circulars/', 'https://exampleuniversity.com/univ-notification/' etc - and there are 7 of those
- These 7 sub-pages are placed in a list (called
paths
) and I iterate over each of them (and append it with the root URL) and scrape for the text of the notification, the URL and the pdf file itself. This all works like a charm. The issue is when I need to write them into the (docker) PostgreSQL table.
FYI - the docker PostgreSQL container part by itself works fine.
So as you can imagine, when I iterate over the list of the sub-pages I get a batch of (scraped) texts + URLs + pdf file that I need to write into the DB.
Following https://www.postgresqltutorial.com/postgresql-python/blob/, I learnt that I will have to open(file, 'rb').read()
, and do psycopg2.Binary(<variable_name>)
when inserting it.
All right. But this has to be done per iteration. So there's a list of columns (like, texts is a column, urls is column etc). So following psycopg2: insert multiple rows with one query (answer by ant32
), I learnt that I will have to args_str = ','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
then cur.execute("INSERT INTO table VALUES " + args_str)
- Please assume I have the variables and the placeholders placed correctly.
SO, I zip the retrieved text + URLs + open('file', 'rb').read()
(this is per iteration) and make a tuple and try to follow the above example, and (naturally) it fails (FYI: every item in the tuple is a group of three elements now, which forms the row that needs to be inserted into the DB - as opposed to the columns that were discussed earlier). And that's the issue I am having.
The error thrown is: sequence item 0: expected str instance, bytes found
In the code below, some comments have been marked with "TO StackOverflow". That'd explain to you a bit more of what I was thinking of doing.
Here's the code:
for path in paths:
for i in range(retry_attempts):
try:
#Creating directoty to save the pdf files
dir_to_create = f'./app/{today}/{path}'
if not os.path.exists(dir_to_create):
os.makedirs(dir_to_create)
print(f'\n****** Scraping - {path} ******'.upper())
univ_notification_url = univ_url + path
univ_notification_page = requests.get(univ_notification_url)
soup = BeautifulSoup(univ_notification_page.content, 'html.parser')
items = soup.find_all('div', {'class': 'entry'})
desired_p_tags = items[0].find_all('p')
#Removing <p> tags with no href (only with the string values as mentioned below)
for _ in desired_p_tags:
if _.text == 'University Circulars' or 'Circulars, Notifications, Letters' in _.text:
desired_p_tags.remove(_)
#Only retaining <p> tags that point to a pdf files (so the ones linking to another webpage is omitted)
desired_p_tags = [_ for _ in desired_p_tags if '.pdf' in (_.find('a'))['href']]
texts = [_.text for _ in desired_p_tags]
urls = [(_.find('a'))['href'] for _ in desired_p_tags]
url_hash = [] #TO STACKOVERFLOW: I am also creating a blake2b hash for the files that I download and save it in the db. So the zipped tuple will contain this too
pdf_obj = []
for url in urls:
filename = os.path.basename(url) # To get the name of the downloaded file
urllib.request.urlretrieve(url, f'{dir_to_create}/{filename}') # Downloading the file
pdf_obj.append(open(f'{dir_to_create}/{filename}', 'rb').read()) # Appending into the list the binary form of the downloaded file
url_hash.append(hashlib.blake2b(open(f'{dir_to_create}/{filename}', 'rb').read()).hexdigest()) # Hashing the file
print(f"Downloaded {filename}\n")
data_to_insert = tuple(zip(texts, urls, pdf_obj, url_hash)) #Zipping the variables into a tuple
# DATABASE PART
conn = psycopg2.connect(
host = 'pg_db',
port = '5432',
database = 'bu_notifications',
user = 'superuser',
password = 'User1234!'
)
cur = conn.cursor()
args_str = ','.join(cur.mogrify("(%s,%s,%s,%s)", x) for x in data_to_insert)
cur.execute("INSERT INTO table00 (notification_text, item_url, pdf_file, pdf_hash) VALUES (%s, %s, %s, %s)" + args_str)
conn.commit()
//Close connection and cursor (also not writing down the except clause that follows after a try)