Scraping multiple single pages from different domains(mostly) with different structure

Question

I have a list of very specific urls that I need to scrape data from (different selectors/fields). There are total of around 1000 links from around 300 different websites that have different structure (selector/xpath). I am trying see if anyone has any suggestion on how this can be done. I looked at the web for solution and can see people recommending Python and Scrapy. While I don't know much about these and still trying to understand, with what I found from the web seems if I use Scrapy/Python for this, looks like I will have to create a separate spider for each links (atleast ones with different structure). I looked at generic spider methods for Scrapy also and tried to use those for my case but they didn't work.

The sample links and fields I wanted to extract are like below where "url" is the page and fields identified by "selector" are the things that I want to extract from that page. And I wanted to have output of each under the field "name"

"urls":[
         {
            "url":"https://www.australianclinicaltrials.gov.au/resources-clinical-trials-australia",
             "fields":[
               {
                  "name":"Body",
                  "selector":"#block-system-main .even"
               },
               {
                  "name":"Page Updated",
                  "selector":"time"
               }
            ]
         },
         {
            "url":"https://www.canada.ca/en/health-canada/corporate/about-health-canada/branches-agencies/health-products-food-branch/biologics-genetic-therapies-directorate.html",
            "fields":[
               {
                  "name":"Body",
                  "selector":"main h1#wb-cont+div"
               },
               {
                  "name":"Page Updated",
                  "selector":"#wb-dtmd time"
               }
            ]
         }
      ]

Lastly, I do have better knowledge of PHP so any suggestion on using PHP for this purpose are also appreciated.

score 0 · Answer 1 · answered Aug 22 '18 at 07:11

0

You have to write spiders for any page that you would want to scrape

The Basic rule of scraping.

Having said that, the links that you have posted look like that of articles or newspapers. If that is your case, you can check out Newspaper3k, it is a python library that extract the contents out of any articles/ newspapers.

How that does is it take the meta data from the article and process that. Since most of the articles provide information in the meta data for SEO purposes, it is mostly likely to scrape almost all the articles around the world.

Check it out here https://github.com/codelucas/newspaper

answered Aug 22 '18 at 07:11

Rahul

159
1
9

Thanks @rahul for the reply/suggestions. As I mentioned I am new and don't have much knowledge on python or scraping data. I am just starting to learn them. What I meant by need to write spider for any page that need to be scraped is, do I need to create a spider-1.py, spider_2.py, ..., spider_n.py to scrape n different pages or can i use one generic spider and create a sort of instances that takes page specific parameters. Again, if I am kind of thinking this in sort of object oriented approach and please guide me if I am on wrong path. Once again thanks for the reply! Regards, Orish – SorishK Aug 28 '18 at 14:51
Hey Orish, the answer is NO. As I had told already, Newspaper3k eliminates the need to create multiple spiders. You could process any kind of articles without creating the spiders itself. All you have to do is, give the article url to be extracted to the Newspaper3k library and it will extract the data out of the article. "There will be no need for creating spiders for each article". Hope this helps. – Rahul Aug 29 '18 at 06:33
Orish, How was it? Yours case is quite similar to mine. Newspaper3k works for me 90% of the time. Hope it works 95% though. I also have many websites to extract information, but limited by the functionality of the author extractor of newspaper3k. Because the structure is different (as you said), some Authors are not picked up. – tursunWali Feb 11 '21 at 18:01

Scraping multiple single pages from different domains(mostly) with different structure

1 Answers1