0

I have a problem with attempting to pipeline some entries into a Postgresql database. The loader is in this file movie_loader.py provided to me:

import csv

"""
This program generates direct SQL statements from the source Netflix Prize files in order
to populate a relational database with those files’ data.

By taking the approach of emitting SQL statements directly, we bypass the need to import
some kind of database library for the loading process, instead passing the statements
directly into a database command line utility such as `psql`.
"""

# The INSERT approach is best used with a transaction. An introductory definition:
# instead of “saving” (committing) after every statement, a transaction waits on a
# commit until we issue the `COMMIT` command.
print('BEGIN;')

# For simplicity, we assume that the program runs where the files are located.
MOVIE_SOURCE = 'movie_titles.csv'
with open(MOVIE_SOURCE, 'r+', encoding='iso-8859-1') as f:
    reader = csv.reader(f)
    for row in reader:
        id = row[0]
        year = 'null' if row[1] == 'NULL' else int(row[1])
        title = ', '.join(row[2:])

        # Watch out---titles might have apostrophes!
        title = title.replace("'", "''")
        print(f'INSERT INTO movie VALUES({id}, {year}, \'{title}\');')
        sys.stdout.reconfigure(encoding='UTF08')

# We wrap up by emitting an SQL statement that will update the database’s movie ID
# counter based on the largest one that has been loaded so far.
print('SELECT setval(\'movie_id_seq\', (SELECT MAX(id) from movie));')

# _Now_ we can commit our transation.
print('COMMIT;')

However, when attempting to pipeline this file into my database, I get the following error, which seems to be some kind of encoder error. I am using git bash as my terminal.

$ python3 movie_loader.py | psql postgresql://localhost/postgres
stdin is not a tty
Traceback (most recent call last):
  File "C:\Users\dhuan\relational\movie_loader.py", line 28, in <module>
    print(f'INSERT INTO movie VALUES({id}, {year}, \'{title}\');')
OSError: [Errno 22] Invalid argument
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1252'>
OSError: [Errno 22] Invalid argument

It seems as if maybe my dataset has an error? I'm not sure specifically what the error is pointing at. Any insight is appreciated

  • I don't know why you used the [tag:git] tag, as Git doesn't appear anywhere in any of this. It's a simple text encoding issue with Python: your output encoding is `cp1252` and your input has some character(s) that cannot be represented in this encoding. – torek Oct 17 '22 at 00:06
  • sorry about the git tag, i clicked it just because i was using git bash as a terminal. For the output part - how would I be able to convert the encoding 'cp1252' into utf-8 or 'iso-8859-1' ? – database student Oct 17 '22 at 00:09
  • Python's `sys.stdout` encoding is based on your environment, but can now be changed if you like: see [How to set sys.stdout encoding in Python 3?](https://stackoverflow.com/q/4374455/1256452) – torek Oct 17 '22 at 00:11
  • I'm trying to use ```sys.stdout.reconfigure(encoding='UTF-8)``` and it's still giving me an error, i've edited the main to show it. – database student Oct 17 '22 at 02:46
  • Your call to `sys.stdout.reconfigure(encoding='UTF08')` is too late. Do it very early on in the script. (Also, check your Python3 version to make sure you have 3.7 or later; many systems still ship ancient Python3.) – torek Oct 17 '22 at 04:27

0 Answers0