0

I'm quite new to crawling. I crawled a webpage and extracted hyperlinks which I then fed to Apache Nutch 1.18. All the urls were rejected as Malformed. What I'm trying to do is crawl a projects database page, extract their hyperlinks, and then crawl each page separately.

I crawled the database page using Scrapy and saved the result as a Json file. Then I parsed the json file to extract the links, and fed these links to Nutch for a deep crawl of each page.

I have tried to validate these links, and I get that they are all wrong:

def url_check(url):

min_attr = ('scheme' , 'netloc')
try:
    result = urlparse(url)
    if all([result.scheme, result.netloc]):
        print ('correct')
    else:
        print('wrong')
except:
    print ('wrong')

My goal now is to fix these links so that Nutch will accept them.

This is the code I used to extract the links from the JSON file:

if __name__ == '__main__':
print('starting link extraction')
fname = "aifos.json"
with codecs.open(fname, "rb", encoding='utf-8') as f:
    links_data = f.read()
json_data = simplejson.loads(links_data)

all_links =[]
for item in json_data:
    website = item['link']

Can someone help? I have tried a few suggestions, but they keep failing.

Please note that I'm not trying to validate the urls, I have already found that they are invalid. I am trying to fix them. These URLS all work. I have accessed them. I'm not sure now if there is something wrong with my original crawl code. Please see it below. The 'link' object is what I am having the problems with now.

    def parse_dir_content(self, response):
    items = AifosItem()

    #all_projects = response.css('div.node__content')
    title = response.css('span::text').extract()
    country = response.css('.details__item::text').extract()
    link = response.css('dd.details__item.details__item--long a::attr(href)').extract()
    short_description = response.css('.field.field--name-field-short-description.field--type-text-long.field--label-hidden').extract()
    long_description = response.css('.field.field--name-field-long-description.field--type-text-long.field--label-hidden').extract()
    #long_description = response.css('.node__content--main').extract()

    items['title'] = title
    items['country'] = country
    items['link'] = link
    items['short_description'] = short_description
    items['long_description'] = long_description

    yield items

Edit: - The summary here is this - how do I fix malformed urls for a crawler? These urls do work when clicked on, but the crawler is rejecting them as malformed, and when I test them, I get the error that they are not valid. Did I miss a parse? This is why I added my Scrapy crawl code, which was used to extract these urls from the parent page.

Phoenix
  • 35
  • 7
  • your question is not clear, what's your question ? – parik May 21 '21 at 14:26
  • Sorry about that. I thought it was clear. I have a bunch of urls which I extracted from a page through crawling. I now want crawl these urls, but the crawler has rejected all with an error that they are malformed. Going back to the original page to click these links, shows that they infact lead to the pages we are interested in crawling. My question is really about what fix is needed to enable me successfully crawl these sites. Like, is there some sort of parse I'm not doing properly? – Phoenix May 21 '21 at 14:39
  • If you can put an example of "malformed url" and your Log – parik May 21 '21 at 14:51
  • It's a mix of urls - [traivefinance.com], [www.ceibal.edu.uy],[www.talovstudio.com], [https://portaltelemedicina.com.br/en/telediagnostic-platform],[www.notco.com] This is the error I get from Apache Nutch, and it fails to inject the urls for crawling :- `Skipping traivefinance.com:java.net.MalformedURLException: no protocol: traivefinance.com Skipping www.ceibal.edu.uy:java.net.MalformedURLException: no protocol: www.ceibal.edu.uy Skipping www.talovstudio.com:java.net.MalformedURLException: no protocol: www.talovstudio.com` – Phoenix May 21 '21 at 15:30
  • I've seen a suggestion here - https://stackoverflow.com/questions/1706493/java-net-malformedurlexception-no-protocol on how to fix it with Java. I'm looking for a Python version. – Phoenix May 21 '21 at 15:34

1 Answers1

1

Have fixed this now. Found a way to fix the url here: How can I prepend the 'http://' protocol to a url when necessary?

This fixed the protocols in Nutch, but I also found that I needed to update my regex-urlfilter.txt in nutch as I had put in a regex expression that made the injector reject non-matching urls. A bit embarrassing, that.

Phoenix
  • 35
  • 7