-1

I'm new to Python and I just start to learn Scrapy.

My codes in spider file are as follows:

from openbl.items import OpenblItem
import scrapy
import time

class OpenblSpider(scrapy.Spider):
    name='openbl'
    start_url=['http://www.openbl.org/lists/base_1days.txt']
    def parse(self, response):
        #get the content within 'pre', select the 1st element to get the content string.
        #split the space of the content
        content=response.xpath('/pre/text').extract()[0].split()
        # This for loop is used to get the num of element in list content
        # after which the elements of the list are the IPs we desire.
        for i in range(0,len(content)):
            if content[i]=='ip':
                i+=1
                break
            else:
                pass
        # construct a new list content_data for putting IPs in.
        content_data=[]
        # This for loop put useful data(IPs) into the new list above.
        for x in range(i,len(content)):
            content_data.append(content(i))

        for cont in content_data:
            item=OpenblItem()
            item['name']=cont
            item['date']=time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
            item['type']='other'
            yield item

The website I'm crawling is: http://www.openbl.org/lists/base_1days.txt I want to get IPs from this website as item['name']

I'd be grateful if someone kind could answer my question.


Some error comes up now..

 V:\work\openbl>scrapy crawl openbl -o openbl_data.json
2017-01-05 10:46:22 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: openbl)
2017-01-05 10:46:22 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'openbl.spiders', 'FEED_URI': 'openbl_data.json', 'SPIDER_MODULES': ['openbl.spiders'], 'BOT_NAME': 'openbl', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2017-01-05 10:46:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-01-05 10:46:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-01-05 10:46:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-05 10:46:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-05 10:46:22 [scrapy.core.engine] INFO: Spider opened
2017-01-05 10:46:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-05 10:46:22 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-05 10:46:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.openbl.org/robots.txt> (referer: None)
2017-01-05 10:46:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.openbl.org/lists/base_1days.txt> (referer: None)
2017-01-05 10:46:23 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.openbl.org/lists/base_1days.txt> (referer: None)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "V:\work\openbl\openbl\spiders\openbl_spider.py", line 11, in parse
    content=response.xpath('/pre/text').extract()[0].split()
IndexError: list index out of range
2017-01-05 10:46:23 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-05 10:46:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 454,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 4907,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 1, 5, 2, 46, 23, 515000),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 1, 5, 2, 46, 22, 383000)}
2017-01-05 10:46:23 [scrapy.core.engine] INFO: Spider closed (finished)
simon s
  • 135
  • 1
  • 4
  • 15
  • 1
    the class variable should be `start_urls` not `start_url` – eLRuLL Jan 05 '17 at 02:43
  • 1
    It looks like you are reading a text file. You can just iterate through each lines in `response.body` and ignore the first 4 lines that are commented. –  Jan 05 '17 at 02:50
  • You could simply do `python -c 'import requests; ips = [ip for ip in requests.get("http://www.openbl.org/lists/base_1days.txt").c‌ontent.split("\n") if ip and not ip.startswith("#")]; print ips'` (StackOverflow inserts zero-width spaces in the comments so you'll most likely get a `SyntaxError` if you copy/paste this code. Re-write it instead.) – jDo Jan 05 '17 at 03:06

1 Answers1

0

There are several issues with this code, first of all

start_urls Otherwise I don't think its gonna crawl anything

You are getting that error is because it is a plain text file, the response.body is literally plain text. There is no tag. So you are getting an index out of bound exception. You can simply handle it as plain text and extract information by regular expression, splitting by \ns, etc. There are way too many ways to do this.

Also, Don't use a loop variable ,in your case i like that. This just doesn't feel right. if you want to find the index of first occurrence of something in a list, there are index() function. Or What is the best way to get the first item from an iterable matching a condition?

Community
  • 1
  • 1
Bobby
  • 1,511
  • 1
  • 15
  • 24