python2.7 crawler without BeautifulSoup

Question

I started a little crawler with local html file without using beautifulsoup from bs4, the issues I am facing are:

I got duplicated results after the script finished with 100% accomplish without knowing why there are duplicated elements
When I tried to compare two lists extracted and test links i got only false result (string comparison).

You can find the full files here.

The html code (links part):

<h2> test #2 <a href="file:///home/godzilla/crawler-project/test-2.html">first website</a></h2>
<h2> test #3 <a href="file:///home/godzilla/crawler-project/test-3.html">a second website refer to index file</a></h2>
<h2> test #4 <a href="file:///home/godzilla/crawler-project/test-4.html"> a website contain many links , atree one of them refer to index file and another duplicated link </a></h2>

Python script:

import string
import urllib

links = []

def ex_link(y):
    result = []
    for i in y:
        if i != None and i.find("<h2>") != -1:
            result.append(i)
    for element in result:
        for i in range(len(result)):
           raw = result[result.index(element)].replace(" ","")
           first = raw.find("<ahref")
           end = raw.find(".html")
           links.append(raw[first+7:end+6])
    print links
    return links
def page(x):
    page = urllib.urlopen(x)
    result = []
    while True:
        page_line = page.readline()
        result.append(page_line)
        if not page_line : break
    return result
link_test =  page("file:///home/godzilla/crawler-project/test-1.html")

the links test are

"file:///home/godzilla/crawler-project/test-2.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-4.html"
"file:///home/godzilla/crawler-project/test-4.html"
"file:///home/godzilla/crawler-project/test-4.html"

and the test links are

"file:///home/godzilla/crawler-project/test-2.html"

file:///home/godzilla/crawler-project/test-3.html

file:///home/godzilla/crawler-project/test-4.htm

http://www.google.com

file:///home/godzilla/crawler-project/test-1.html

the output results are :

"file:///home/godzilla/crawler-project/test-2.html" not found !

"file:///home/godzilla/crawler-project/test-3.html" not found !

"file:///home/godzilla/crawler-project/test-3.html" not found !

"file:///home/godzilla/crawler-project/test-3.html" not found !

"file:///home/godzilla/crawler-project/test-4.html" not found !

"file:///home/godzilla/crawler-project/test-4.html" not found !

"file:///home/godzilla/crawler-project/test-4.html" not found !

There are lot of good reasons to use BeautifulSoup and really few to not use it, one of them is [regex cannot parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) — Arount, May 31 '18 at 08:11
@Dennis well I am aware of that but its for practicing reason i will use latter but the error that i can not solve is weird for me — Godzilla, May 31 '18 at 16:28

python2.7 crawler without BeautifulSoup

0 Answers0