1

I'm trying to get some data into a postgreSQL database using Psycopg2. The function I'm using to load the database is as follows:

def load_db():
    data = clean_data()

    conn = psycopg2.connect(database='database', user='user')
    cur = conn.cursor()

    for d in data:
        publisher_id = (d[5]['publisher_id'])
        publisher = (d[4]['publisher'])

        cur.execute("INSERT INTO publisher (id, news_org) SELECT (%s,%s) WHERE NOT EXISTS (SELECT id FROM publisher WHERE id = %s);",
           (publisher_id, publisher))

    conn.commit()
    cur.close()
    conn.close()

But I am getting the error IndexError: tuple index out of range and not really sure what I'm doing wrong. In the records I'm trying to input there are a lot of repeated publisher_id's and publisher's, hence the WHERE NOT EXISTS. I'm pretty new to working with databases through python so I'm sure it's some thing simple. Thanks in advance!

UPDATE!

A sample of the data is as follows:

 [{'article_id': 7676933011},
  {'web_id': u'world/2015/jul/03/iranian-foreign-minister-raises-prospect-of-joint-action-against-islamic-state'},
  {'title': u'Iranian foreign minister raises prospect of joint action against Islamic State'},
  {'pub_date': u'2015-07-03T21:30:51Z'},
  {'publisher': 'The Guardian'},
  {'publisher_id': '1'},
  {'author': u'Julian Borger'},
  {'author_id': u'15924'},
  {'city_info': [{'city_name': u'Vienna',
                  'country_code': u'US',
                  'id': 4791160,
                  'lat': 38.90122,
                  'lon': -77.26526}]},
  {'country_info': [{'country_code': u'IR',
                     'country_name': u'Islamic Republic of Iran',
                     'lat': 32.0,
                     'lon': 53.0},
                    {'country_code': u'US',
                     'country_name': u'United States',
                     'lat': 39.76,
                     'lon': -98.5}]},
  {'org_info': [{'organization': u'Republican'},
                {'organization': u'US Congress'},
                {'organization': u'Palais Coburg Hotel'},
                {'organization': u'Islamic State'},
                {'organization': u'United'}]},
  {'people_info': [{'people': u'Mohammad Javad Zarif'},
                   {'people': u'John Kerry'}]}]

The full traceback is:

Traceback (most recent call last):
  File "/Users/Desktop/process_text/LoadDB.py", line 69, in <module>
    load_db()
  File "/Users/Desktop/process_text/LoadDB.py", line 50, in load_db
    (publisher_id, publisher))
IndexError: tuple index out of range
sammy88888888
  • 458
  • 1
  • 5
  • 18

1 Answers1

4

The issue is in your cur.execute() line -

cur.execute("INSERT INTO publisher (id, news_org) SELECT (%s,%s) WHERE NOT EXISTS (SELECT id FROM publisher WHERE id = %s);",
       (publisher_id, publisher))

As you can see above you are using three %s - ...SELECT (%s,%s)...WHERE id = %s); , but you are only providing value for two (two values in the tuple) .

When cur.execute internally tries to find the third value, it causes the index issue.

I am not sure what values would be correct there, but you need to either change that to 2 %s , or provide a third value in the tuple - (publisher_id, publisher) .

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Yeah that makes sense. basically i'm trying to ensure that once `publisher_id` and `publisher` are inserted, it won't break when it's tried to be inserted again. I've changed the statement to: `("INSERT INTO publisher (id, news_org) SELECT (%s,%s) WHERE NOT EXISTS (SELECT id FROM publisher);", (publisher_id, publisher))` but now get the error: `ProgrammingError: INSERT has more target columns than expressions`. – sammy88888888 Jul 04 '15 at 14:17
  • I am not sure about this, but I dont think you are doing this correctly, `SELECT (%s,%s)` the items after select are actually column names not data, and I think those column names should be static for you, why do you need to pass them as %s ? – Anand S Kumar Jul 04 '15 at 14:20
  • @sammy: There are three placeholders `%s`, so the argument you pass needs 3 elements. Try `(publisher_id, publisher, publisher_id)`. Also be aware that this statement can fail when there are concurrent writes: http://stackoverflow.com/a/13342031/190597. – unutbu Jul 04 '15 at 14:24
  • great, that seems to have done the trick, simply a placeholder issue. like i say, new to this so all a learning curb. thanks for all your help! – sammy88888888 Jul 04 '15 at 14:41