How To Make A Web Crawler More Efficient?

Question

Here is a code:

str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'

import urllib.request
from Stacks import Stack
import re
import functools
import operator as op
from nary_tree import *
url = 'http://www.activeingredients.com/'
s = set()
List = []
url_list = []
def f_go(List, s, url):
    try:
        if url in s:
            return
        s.add(url)
        with urllib.request.urlopen(url) as response:
            html = response.read()
            #print(url)
        h = html.decode("utf-8")
        lst0 = prepare_expression(list(h))
        ntr = buildNaryParseTree(lst0)
        lst2 = nary_tree_tolist(ntr)
        lst3= functools.reduce(op.add, lst2, [])
        str2 = ''.join(lst3)
        List.append(str2)
        f1 = re.finditer(str_regex, h)

        l1 = []
        for tok in f1:
            ind1 = tok.span()
            l1.append(h[ind1[0]:ind1[1]])
    for exp in l1:
        length = len(l1)
        if (exp[-1] == 'g' and exp[length - 2] == 'p' and exp[length - 3] == 'j')  or \
            (exp[-1] == 'p' and exp[length - 2] == 'n' and exp[length - 3] == 'g'):
                pass
        else:
            f_go(List, s, exp, iter_cnt + 1, url_list)
except:
    return

It basically, using, urlllib.request.urlopen, opens urls recursively in a loop; does tis in certain domain (in that case activeingredients.com); link extraction form a page is done by regexpression. Inside, having open page it parse it and add to a list as a string. So, what this is suppose to do is go through given domain, extract information (meaningful text in that case), add to a list. Try except block, just returns in the case of all the http errors (and all the rest errors too, but this is tested and working).
It works, for example, for this small page, but for bigger is extremely slow and eat memory.
Parsing, preparing page, more or less do the right job, I believe.
Question is, is there an efficient way to do this? How web searches crawl through network so fast?

If this is **working code**, see [codereview.se]. But why are you [parsing HTML with regex](http://stackoverflow.com/a/1732454/3001761)? — jonrsharpe, Jan 22 '17 at 20:27
Looks to me you should definitely use multithreading since most of the time you are waiting for network content. You should send multiple requests concurrently. — Willem Van Onsem, Jan 22 '17 at 20:27
This is a good question, but the answer is complex. For starters, you should be using a database to store your data so that the entire dataset does not need to be loaded into memory. Also, you should be loading multiple web requests in parallel. But yeah, this is kind of a complex task. Maybe try looking for an existing library that does this? — dana, Jan 22 '17 at 20:27
@jonrsharpe: even worse: using iterators over the content and checking every character. — Willem Van Onsem, Jan 22 '17 at 20:28
@jonrsharpe That's one hell of an answer you linked. Went straight from 0 to /r/45thworldproblems. — Tagc, Jan 22 '17 at 20:33
@Tagc: It used to be pointed to by so many links that it was sort of banned for a while. Seems nobody likes too much of Yeats. — Jongware, Jan 22 '17 at 20:39
@jonrsharpe: regex is just for extract links from document, parser is standard - building an nary syntax tree. And yea, this code works, but it's so slow, that I decided to put it here... — , Jan 22 '17 at 20:40
So multithreading and using database would be good start, will try; any existing library to do this recommendation? — , Jan 22 '17 at 20:42

score 0 · Answer 1 · edited Dec 06 '17 at 22:33

0

First: I don't think Google's webcrawler is running on one laptop or one pc. So don't worry if you can't get results like big companies do.

Points to consider:

You could start with a big list of words you can download from many websites. That sorts out some useless combinations of url's. After that you could crawl just with letters to get useless-named-sites on your index as well.
You could start with a list of all registered domains on dns servers. I.E. something like this: http://www.registered-domains-list.com
Use multiple threads
Have much bandwidth
Consider buying Google's Data-Center

These points are just ideas to give you a basic idea of how you could improve your crawler.

edited Dec 06 '17 at 22:33

halfer

19,824
17
99
186

answered Jan 22 '17 at 22:23

fameman

3,451
1
19
31

Thanks, that's clear, I see its bigger than I thought. – Jan 23 '17 at 12:03
Yes. And it's not possible to have it stay small (unfortunately). If you think that answer helps you can give an upvote (only if you want to). I am glad I could help you. Happy Coding ;-) – fameman Jan 23 '17 at 12:38

How To Make A Web Crawler More Efficient?

1 Answers1