0

I am facing some difficulties in inserting binary values (bytea) in a PostgreSQL DB using psycopg2. I am new to Postgres so pardon my ignorance, please.

So, here's the scenario:

  1. I am scraping my university website (using BeautifulSoup) and am retrieving the notification they post in the Notification section.
  2. It's a shitty site, so the notification is posted in just a webpage, with the notification text hyperlinked to a pdf document.
  3. There are 7 sub-pages that I scrape. I know I am not using the right web-dev terminology, so here's what I mean: The root URL is https://exampleuniversity.com/ and the sub-pages are (in bold) 'https://exampleuniversity.com/univ-circulars/', 'https://exampleuniversity.com/univ-notification/' etc - and there are 7 of those
  4. These 7 sub-pages are placed in a list (called paths) and I iterate over each of them (and append it with the root URL) and scrape for the text of the notification, the URL and the pdf file itself. This all works like a charm. The issue is when I need to write them into the (docker) PostgreSQL table.

FYI - the docker PostgreSQL container part by itself works fine.

So as you can imagine, when I iterate over the list of the sub-pages I get a batch of (scraped) texts + URLs + pdf file that I need to write into the DB.

Following https://www.postgresqltutorial.com/postgresql-python/blob/, I learnt that I will have to open(file, 'rb').read(), and do psycopg2.Binary(<variable_name>) when inserting it.

All right. But this has to be done per iteration. So there's a list of columns (like, texts is a column, urls is column etc). So following psycopg2: insert multiple rows with one query (answer by ant32), I learnt that I will have to args_str = ','.join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup) then cur.execute("INSERT INTO table VALUES " + args_str) - Please assume I have the variables and the placeholders placed correctly.

SO, I zip the retrieved text + URLs + open('file', 'rb').read() (this is per iteration) and make a tuple and try to follow the above example, and (naturally) it fails (FYI: every item in the tuple is a group of three elements now, which forms the row that needs to be inserted into the DB - as opposed to the columns that were discussed earlier). And that's the issue I am having.

The error thrown is: sequence item 0: expected str instance, bytes found

In the code below, some comments have been marked with "TO StackOverflow". That'd explain to you a bit more of what I was thinking of doing.

Here's the code:

for path in paths:
    for i in range(retry_attempts):
        try:
            #Creating directoty to save the pdf files
            dir_to_create = f'./app/{today}/{path}'
            if not os.path.exists(dir_to_create):
                os.makedirs(dir_to_create)

            print(f'\n****** Scraping - {path} ******'.upper())
            univ_notification_url = univ_url + path
            univ_notification_page = requests.get(univ_notification_url)
            soup = BeautifulSoup(univ_notification_page.content, 'html.parser')
            items = soup.find_all('div', {'class': 'entry'})
            desired_p_tags = items[0].find_all('p')

            #Removing <p> tags with no href (only with the string values as mentioned below)
            for _ in desired_p_tags:
                if _.text == 'University Circulars' or 'Circulars, Notifications, Letters' in _.text:
                    desired_p_tags.remove(_)

            #Only retaining <p> tags that point to a pdf files (so the ones linking to another webpage is omitted)
            desired_p_tags = [_ for _ in desired_p_tags if '.pdf'  in (_.find('a'))['href']]

            texts = [_.text for _ in desired_p_tags]
            urls = [(_.find('a'))['href'] for _ in desired_p_tags]

            url_hash = [] #TO STACKOVERFLOW: I am also creating a blake2b hash for the files that I download and save it in the db. So the zipped tuple will contain this too
            pdf_obj = []
            for url in urls:
                filename = os.path.basename(url) # To get the name of the downloaded file 
                urllib.request.urlretrieve(url, f'{dir_to_create}/{filename}') # Downloading the file
                pdf_obj.append(open(f'{dir_to_create}/{filename}', 'rb').read()) # Appending into the list the binary form of the downloaded file
                url_hash.append(hashlib.blake2b(open(f'{dir_to_create}/{filename}', 'rb').read()).hexdigest()) # Hashing the file
                print(f"Downloaded {filename}\n")
                
            data_to_insert = tuple(zip(texts, urls, pdf_obj, url_hash)) #Zipping the variables into a tuple

            # DATABASE PART
            conn = psycopg2.connect(
                host = 'pg_db',
                port = '5432',
                database = 'bu_notifications', 
                user = 'superuser', 
                password = 'User1234!'
            )

            cur = conn.cursor()

            args_str = ','.join(cur.mogrify("(%s,%s,%s,%s)", x) for x in data_to_insert)
            cur.execute("INSERT INTO table00 (notification_text, item_url, pdf_file, pdf_hash) VALUES (%s, %s, %s, %s)" + args_str)
            conn.commit()
            //Close connection and cursor (also not writing down the except clause that follows after a try)
Vin
  • 968
  • 2
  • 10
  • 22
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47

1 Answers1

0

OK. I managed to fix it. This is how I did it:

  1. I created a list that held the values of psycopg2.Binary(open(f'{dir_to_create}/{filename}', 'rb').read()) itself, along with the already existing URL and Text list. Then created the tuple out of the three.

  2. Then iterated over the tuple and inserted each element to the table like this (mind you - there's an additional column):

             for _ in data_to_insert:
             cur.execute("INSERT INTO table00 (notification_text, item_url, pdf_file, pdf_hash) VALUES (%s, %s, %s, %s)", _)
             conn.commit()
    

It now writes into the db (at least I see it there in PGAdmin). Now I need to find a way to download it from there locally, but that's for another Stack Overflow post.

Anonymous Person
  • 1,437
  • 8
  • 26
  • 47