Why would scraperwiki omit lines from scraped html?

Question

I have a really simple python script on scraperwiki:

import scraperwiki
import lxml.html

html = scraperwiki.scrape("http://www.westphillytools.org/toolsListing.php")
print html

I haven't written anything to parse it yet... for now I just want the html.

When I run it in edit mode it works perfectly.

When a scheduled scrape runs (or I manually run it), it omits dozens (or even hundreds) of lines.

It's a very small webpage so data overload shouldn't be a problem. Any ideas?

Are you sure it's not an artefact of how printing is handled on scraperwiki? — Marcin, Mar 07 '12 at 14:39
not sure... I get a line in the middle of my html output that reads like this - the actual numbers vary each time (brackets included): [53 lines, 159000 characters omitted] — maneesha, Mar 07 '12 at 14:43
interesting! did you have a need for the output in some way, or are you just curious as to how ScraperWiki works and when it truncates it? — frabcus, Mar 07 '12 at 16:16

score 0 · Answer 1 · answered Mar 07 '12 at 14:45

0

It sounds like the data are there in your variable. Try printing it a line at a time.

answered Mar 07 '12 at 14:45

Marcin

48,559
18
128
201

score 0 · Accepted Answer · answered Mar 07 '12 at 16:14

0

In the editor, individual print statements are rolled up into one line for display. You can click "more..." in the console on the editor to view the whole lot.

When run scheduled, it's just output exactly like in any console. So if there are carriage returns in the HTML, you'll get lots of lines of output.

To reduce the amount of output we store, we truncate large outputs from scheduled runs. That's where you've seen "[53 lines, 159000 characters omitted]".

It's not really intended that stdout from scheduled runs is for anything other than debugging. You need to save to the datastore for output you want to use.

answered Mar 07 '12 at 16:14

frabcus

919
1
7
18

thanks... I didn't know that you couldn't just store the entire html. – maneesha Mar 08 '12 at 13:47
Not sure what you mean by store... the stored stdout from a scheduled run is meant to just be for debugging. You can store other stuff in the SQLite database... – frabcus Mar 09 '12 at 15:03

Why would scraperwiki omit lines from scraped html?

2 Answers2