I have a list of very specific urls that I need to scrape data from (different selectors/fields). There are total of around 1000 links from around 300 different websites that have different structure (selector/xpath). I am trying see if anyone has any suggestion on how this can be done. I looked at the web for solution and can see people recommending Python and Scrapy. While I don't know much about these and still trying to understand, with what I found from the web seems if I use Scrapy/Python for this, looks like I will have to create a separate spider for each links (atleast ones with different structure). I looked at generic spider methods for Scrapy also and tried to use those for my case but they didn't work.
The sample links and fields I wanted to extract are like below where "url" is the page and fields identified by "selector" are the things that I want to extract from that page. And I wanted to have output of each under the field "name"
"urls":[
{
"url":"https://www.australianclinicaltrials.gov.au/resources-clinical-trials-australia",
"fields":[
{
"name":"Body",
"selector":"#block-system-main .even"
},
{
"name":"Page Updated",
"selector":"time"
}
]
},
{
"url":"https://www.canada.ca/en/health-canada/corporate/about-health-canada/branches-agencies/health-products-food-branch/biologics-genetic-therapies-directorate.html",
"fields":[
{
"name":"Body",
"selector":"main h1#wb-cont+div"
},
{
"name":"Page Updated",
"selector":"#wb-dtmd time"
}
]
}
]
Lastly, I do have better knowledge of PHP so any suggestion on using PHP for this purpose are also appreciated.