Python Scrapy on offline (local) data

Question

I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?

Kyle Kelley · Answer 1 · 2013-11-15T13:19:00.820

34

SimpleHTTP Server Hosting

If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):

python -m SimpleHTTPServer 8000

Then just point scrapy at 127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

file://

An alternative is to just have scrapy point to the set of files directly:

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

Wrapping up

Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:

$ scrapy crawl 127.0.0.1:8000

If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.

edited Nov 15 '13 at 13:19

answered Oct 15 '13 at 16:16

Kyle Kelley

13,804
8
49
78

1

You do realize that awarding the bonus to yourself won't earn you a hat, right? :-P – Martijn Pieters Dec 19 '13 at 02:32
2

@MartijnPieters I'm giving out bounties on several. Happy Holidays! On some level, I was hoping the asker would accept an answer. :-/ – Kyle Kelley Dec 19 '13 at 03:30
2

Your answer is certainly thorough enough to deserve at least *some* feedback, indeed! – Martijn Pieters Dec 19 '13 at 11:39
For python 3.1: python -m http.server – pso May 02 '18 at 16:05

Ratan Kumar · Answer 2 · 2013-10-16T12:56:01.493

9

Go to your Dataset folder :

import os
files = os.listdir(os.getcwd())
for file in files:
    with open(file,"r") as f:
        page_content = f.read()
        #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

No need to go for Scrapy !

edited Oct 16 '13 at 12:56

answered Oct 15 '13 at 17:25

Ratan Kumar

1,640
3
25
52

1

There are so many functionalities that Scrapy provides, e.g., multiple options to get the same field. Of course, you can replicate all these functionalities but that would be reinventing the wheel. – MAltakrori Sep 04 '20 at 16:51

Python Scrapy on offline (local) data

2 Answers2

SimpleHTTP Server Hosting

file://

Wrapping up

Linked