1

I've written a Python 2.7 scraper, but am getting an error when attempting to save my data. The scraper is written in Scraperwiki, but I think that's largely irrelevant to the error I'm getting - saving in Scraperwiki seems to be handled using Sqlalchemy, and it's this that is giving the error.

I get this error message:

Traceback (most recent call last):
  File "./code/scraper", line 192, in <module>
    saving(spreadsheet_pass)
  File "./code/scraper", line 165, in saving
    scraperwiki.sql.save(["URN"], school, "magic")
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sql.py", line 195, in save
    connection.execute(insert.values(row))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 893, in _execute_context
    None, None)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
    exc_info
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 889, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 573, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/processors.py", line 56, in boolean_to_int
    return int(value)
sqlalchemy.exc.StatementError: invalid literal for int() with base 10: 'n/a' (original cause: ValueError: invalid literal for int() with base 10: 'n/a') u'INSERT OR REPLACE INTO magic (published_recent, inspection_rating2, schooltype, "LA", "URL", "URN", schoolname, open_closed, opendate_full, inspection_rating, opendate_short, phase, publication_date, include, notes, inspection_date) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' []

when attempting to save this line of data:

{u'published_recent': 'n/a', u'inspection_rating2': 'n/a', u'schooltype': u'Free school', u'LA': u'Tower Hamlets', u'URL': u'http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/138262', u'URN': u'138262', u'schoolname': u'City Gateway 14-19 Provision', u'open_closed': u'Open', u'opendate_full': u'2012-09-03', u'inspection_rating': 'No section 5 inspection yet', u'opendate_short': u'September 2012', u'phase': u'Alternative provision', u'publication_date': 'n/a', u'include': False, u'notes': 'test message', u'inspection_date': 'n/a'}

using this line of code:

scraperwiki.sql.save(["URN"], school, "magic")

(Which in Scraperwiki, saves the data in the 'school' dictionary to a database called 'magic' using the key 'URN' as the unique key.)

Weirdly, sometimes the scraper works fine and I don't get the error, but other times, running identical code, I get this error.

Things I've tried:

  1. Clearing the database I'm saving to, or starting a new database with a different name. Neither worked.
  2. Editing the data being saved. The error refers to a problem with the 'n/a' value saved against the key 'published_recent'. Previous lines of data, that save with no problem, contain data of Boolean type, so I figured the string is causing difficulties for some reason. Changing the value to an integer means I don't get this error. Right now I can't replicate this (saving seems to be working when the value is an integer), but I think I've received this error when I've tried changing the 'published_recent' value to an integer for the line of data that seems to be causing me problems: sqlalchemy.exc.IntegrityError: (IntegrityError) constraint failed

Either way, this isn't really a solution, as I need to be able to save a string.

  1. Reading all StackOverflow questions about these two errors, and the sqlalchemy documentation. I couldn't find anything that seemed to address the issue I've got.
  2. Using an Autoincrementing key for the data. I'm saving the data on key 'URN', which is unique, but I figured that the scraper might be using the 'published_recent' key as the unique key when saving for some reason, so I tried using an autoincrementing key, following this answer: ScraperWiki: How to create and add records with autoincrement key. Still get the same error.

Thanks in advance for any answers - this is driving me a bit nuts.

Community
  • 1
  • 1
philipnye
  • 352
  • 1
  • 3
  • 16

1 Answers1

0

Error says the value it is trying to save as integer is 'n/a'. If you are scraping data, well you don't always get what you want. Seems like 'n/a' is what they put on the site you are scraping when there's no number for that field. You will have to do some validation on your data before saving it.

thule
  • 4,034
  • 21
  • 31
  • `'n/a'` is actually a value I want to save against the `'published_recent'` key in cases where the scraped page doesn't meet certain criteria. I've not had any difficulty saving other strings using the same approach - but does Sqlalchemy not allow more than one types of data to be saved against a single key in a database (e.g. Booleans and strings, as I'm trying to do here)? I've not been able to find this explicitly stated in the documentation but e.g. this suggests this is the case: http://stackoverflow.com/questions/3167842/one-field-with-different-data-types-sqlalchemy – philipnye Mar 12 '15 at 10:53
  • Well, no, normally it doesn't, and neither do SQL servers (except SQLite). What you can do is save a null value, which is probably what you want. You will still need to translate `'n/a'` to `None` manually. Regarding the link you mentioned: SQLAlchemy allows you do a variety of clever tricks to do anything, but you really shouldn't be doing clever tricks before you get your basics on lockdown. Best advice in learning how SQLAlchemy works is first learn how SQL works. – thule Mar 12 '15 at 13:38
  • Thanks for your answer @letitbee; really appreciated. I'm still a bit puzzled that my scraper sometimes ran fine, sometimes threw up that error, but I've edited my code so that the all values saved against the `'published_recent'` key are strings, rather than a mix of data types - that seems to have stopped the error occurring. Accepting this answer. – philipnye Mar 13 '15 at 18:47
  • Glad I could help. Regarding your confusion, this is probably site you're scraping giving you different data. You have to understand that when you are working with external source of data, it WILL be inconsistent - in fact, it will break any and all assumptions you may have, and you should be ready for this to happen. This is why you need validation. – thule Mar 13 '15 at 19:00
  • Thanks, definitely good advice to follow. In this instance, it was when scraping exactly the same data that sometimes my scraper would work, sometimes it would give an error, which I think was the main thing I find puzzling - but will be taking your advice on board :) – philipnye Mar 13 '15 at 19:15