0

i am trying to scrape a little chunk of information from a site: but it keeps printing "None" as if the title, or any tag if i replace it, doesn't exists.

The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.
 

enter image description here

we have the following set of meta-data for each wordpress-plugin:

Version: 1.9.5.12 
installations: 10,000+    
WordPress Version: 5.0 or higher 
Tested up to: 5.4 PHP  
Version: 5.6 or higher    
Tags 3 Tags:databasemembersign-up formvolunteer
Last updated: 19 hours ago
enter code here

the project consits of two parts:   the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-

   

from bs4 import BeautifulSoup

import requests

#array of URLs to loop through, will be larger once I get the loop working correctly

plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']

 

this can be done like so

ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]

 

the Output of text_nodes:

 

['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 ']  

but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:

 

  • first of all we need to fetch the urls

  • then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently

  • List the 50 newest items - that are the 50 plugins that are updated recently ...

challenge: how to avoid that we overload the RAM while fetching all URLs. (see here How extract all URLs in a website using BeautifulSoup with interesting insights, approaches and ideas.

​ at the moment i try to figure out how to fetch all the urls -and to parse them:

a. how to fetch the meta-data of each plugin: 
b. and how to sort out the range of the newest updates… 
c. afterward how to pick out the 50 newest
zero
  • 1,003
  • 3
  • 20
  • 42

1 Answers1

1
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor

url = "https://wordpress.org/plugins/browse/popular/{}"


def main(url, num):
    with requests.Session() as req:
        print(f"Collecting Page# {num}")
        r = req.get(url.format(num))
        soup = BeautifulSoup(r.content, 'html.parser')
        link = [item.get("href")
                for item in soup.findAll("a", rel="bookmark")]
        return set(link)


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 50)]]

allin = []
for future in futures:
    allin.extend(future.result())


def parser(url):
    with requests.Session() as req:
        print(f"Extracting {url}")
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

Output: view-online

  • this is a great solution - really very very good: btw. i add the title of the plugin - with h1.plugin-title - this is a minor change: this can i do. the harder task - what if we try to fetch all the urls: https://wordpress.org/plugins/ interesting discussion: https://stackoverflow.com/questions/59347372/how-extract-all-urls-in-a-website-using-beautifulsoup where you (!!) add fruitful thougths: - we do not want to explode the RAM - then we fetch the info & have to sort out the plugin that updated most recently - output the last 50 to items any idea? - many thanks – zero Apr 09 '20 at 10:06
  • 1
    @zero there's multiple section of plugins, which one? `block`, `beta` and `featured` – αԋɱҽԃ αмєяιcαη Apr 09 '20 at 10:11
  • hi - many thanks for the quick reply. i suggest we take the popular plugins https://wordpress.org/plugins/browse/popular/ with 99 pages of content: cf ... https://wordpress.org/plugins/browse/popular/page/1/ https://wordpress.org/plugins/browse/popular/page/2/ https://wordpress.org/plugins/browse/popular/page/99/ many thanks in advance... ;) – zero Apr 09 '20 at 12:27
  • to add the title: adding the h1-tag that hold the title ` def main(url, posts): for post in posts: with requests.Session() as req: r = req.get(url.format(post)) soup = BeautifulSoup(r.content, 'html.parser') target = [item.get_text(strip=True, separator=" ") for item in soup.find( "h1", class_="plugin-title"). "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]] new = [x for x in target if x.startswith( ("V", "Las", "Ac", "W", "T", "P"))] print(new)` – zero Apr 09 '20 at 12:40
  • 1
    @zero code updated. and please for your next question. make it smaller instead of long way question. you can just type the points and the desired output. – αԋɱҽԃ αмєяιcαη Apr 09 '20 at 15:08
  • 1
    @zero complete output is [here](https://paste.centos.org/view/raw/4893f7ae) – αԋɱҽԃ αмєяιcαη Apr 09 '20 at 15:18
  • many thanks - this is more than expected - you are a true hero!!! Have a great day and a great weekend ;) – zero Apr 09 '20 at 16:12
  • hi dear αԋɱҽԃ αмєяιcαη - the parser is great i really love it. i want to run this in a PHP-based webpage i guess that i need to port it over to PHP if not using escape shellcmd and shell_exec. Python script starts with #!/usr/bin/env python, what is aimed: run it with some tiny output of the "neweset updates `[plugin1', 'Version: 2.34.1', 'Last updated: 5 months ago', 'Tags: magna, sed diam voluptua. At vero eos et accusam'] [plugin2', 'Version: 6.54.1', 'Last updated: 5 months ago', 'Tags: lorem ipsum amet']` in a phpBB-block (like RSS-script-output) - do you can help? Thx alot. – zero May 04 '20 at 16:07