0

In the function get_links, I am fetching the links of URLs. And in Scrape function, I am getting the content of each URL using text_from_html function( Not in the code). I want to append the url and visible_text into two lists containing urls and visible_text of each url. Here the list contains only one item and previous one is getting replaced. I want to keep the previous values also. I'm getting the output as:

['https://www.scrapinghub.com']
['https://www.goodreads.com/quotes']

I need them in a single list.

def get_links(url):
        visited_list.append(url)
        try:
            source_code = requests.get(url)
        except Exception:
            get_links(fringe.pop(0))
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text,"lxml")
        for link in soup.findAll(re.compile(r'(li|a)')):
            href = link.get('href')
            if (href is None) or (href in visited_list) or (href in fringe) or (('http://' not in href) and ('https://' not in href)):
                continue
            else:
                subs = href.split('/')[2]
                fstr = repr(fringe)
                if subs in fstr:
                    continue
                else:
                    if('blah' in href):
                        if('www' not in href):
                            href = href.split(":")[0] + ':' + "//" + "www." + href.split(":")[1][2:]
                            fringe.append(href)
                        else:
                            fringe.append(href)

        return fringe

def test(url):
    try:
        res = requests.get(url)
        plain_text = res.text
        soup = BeautifulSoup(plain_text,"lxml")
        visible_text = text_from_html(plain_text)
        URL.append(url)
        paragraph.append(visible_text)
    except Exception:
        print("CHECK the URL {}".format(url))

if __name__ == "__main__":
    p = Pool(10)
    p.map(test,fringe)
    p.terminate()
    p.join()
user8128965
  • 41
  • 1
  • 7
  • 1
    It would help if you explain your problem with respect to the code you posted. What is the expected output and what do you get instead? – swathis Feb 13 '19 at 13:45
  • Possible duplicate of [Multiprocessing of shared list](https://stackoverflow.com/questions/23623195/multiprocessing-of-shared-list) – stovfl Feb 13 '19 at 13:59
  • @swathis I edited my question can you help please – user8128965 Feb 13 '19 at 15:05
  • @user8128965 - It's still not quite clear. The output you mentioned `['https://www.scrapinghub.com'] ['https://www.goodreads.com/quotes']` corresponds to which variable? Based on the code, it could be `visited_list` or `URL`.. ? But then where are these variables defined? Are they global variables? It would be best if you could create a minimal example that we can run and replicate the issue. This way, you also get the answer fast. – swathis Feb 14 '19 at 08:22

0 Answers0