I started a little crawler with local html file without using beautifulsoup from bs4, the issues I am facing are:
- I got duplicated results after the script finished with 100% accomplish without knowing why there are duplicated elements
- When I tried to compare two lists extracted and test links i got only false result (string comparison).
You can find the full files here.
The html code (links part):
<h2> test #2 <a href="file:///home/godzilla/crawler-project/test-2.html">first website</a></h2>
<h2> test #3 <a href="file:///home/godzilla/crawler-project/test-3.html">a second website refer to index file</a></h2>
<h2> test #4 <a href="file:///home/godzilla/crawler-project/test-4.html"> a website contain many links , atree one of them refer to index file and another duplicated link </a></h2>
Python script:
import string
import urllib
links = []
def ex_link(y):
result = []
for i in y:
if i != None and i.find("<h2>") != -1:
result.append(i)
for element in result:
for i in range(len(result)):
raw = result[result.index(element)].replace(" ","")
first = raw.find("<ahref")
end = raw.find(".html")
links.append(raw[first+7:end+6])
print links
return links
def page(x):
page = urllib.urlopen(x)
result = []
while True:
page_line = page.readline()
result.append(page_line)
if not page_line : break
return result
link_test = page("file:///home/godzilla/crawler-project/test-1.html")
the links test are
"file:///home/godzilla/crawler-project/test-2.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-3.html"
"file:///home/godzilla/crawler-project/test-4.html"
"file:///home/godzilla/crawler-project/test-4.html"
"file:///home/godzilla/crawler-project/test-4.html"
and the test links are
"file:///home/godzilla/crawler-project/test-2.html"
file:///home/godzilla/crawler-project/test-3.html
file:///home/godzilla/crawler-project/test-4.htm
http://www.google.com
file:///home/godzilla/crawler-project/test-1.html
the output results are :
"file:///home/godzilla/crawler-project/test-2.html" not found !
"file:///home/godzilla/crawler-project/test-3.html" not found !
"file:///home/godzilla/crawler-project/test-3.html" not found !
"file:///home/godzilla/crawler-project/test-3.html" not found !
"file:///home/godzilla/crawler-project/test-4.html" not found !
"file:///home/godzilla/crawler-project/test-4.html" not found !
"file:///home/godzilla/crawler-project/test-4.html" not found !