How to scrape data from different sites with different source code using a single script?

Question

I've written a script in python to parse different profile names available in different websites. Each link is connected to each individual in which their profile information are available. At this moment, I'm only interested to scrape their profile names. I've provided three different links to three different persons in my script. The below script is doing just fine. I've used css selectors to scrape the profile information from the three sites. As it is limited in number, I have got it handled. However, it could have been hundreds of links.

Now, my question is: as each site contains very different source code from each other, how can I get all the profile names out of those sites with a single script apart from what I did here by including separate selectors as those sites selectors are known to me? What if the links are hundreds in numbers?

Here is what I've written to get the profile names (it's doing fine here):

import requests 
from bs4 import BeautifulSoup

links = {
"https://www.paulweiss.com/professionals/associates/robert-j-agar",
"http://www.cadwalader.com/index.php?/professionals/matthew-lefkowitz",
"https://www.kirkland.com/sitecontent.cfm?contentID=220&itemID=12061"
}
for link in links:
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("#leftnav,.article,.main-content-container"):
        pro_name = item.select(".page-hdr h1,b.hidepf,.bioBreadcrumb span")[0].text
        print(pro_name)

Output:

Robert J Agar
Matthew Lefkowitz 
Mark Adler

Create a description of the things you are looking for in each site in a separate config file? — Mad Physicist, Dec 28 '17 at 21:45
There are lots of good ways of doing this, but I suspect most of them will involve storing the per-site config in some file or database or another. — Mad Physicist, Dec 28 '17 at 21:45
Thanks @Mad Physicist for you suggestions. Is this a bad question to ask or very easy to answer or unclear to get the thought? if it is, I'll definitely take it out because I've already noticed that someone has hit the close tag. — SIM, Dec 28 '17 at 21:52
The question is about general design, not really programming. It's too broad because there are too many unrelated good answers for it. — Mad Physicist, Dec 29 '17 at 03:54
Any aspect about how that general design might look like or any link where something alike has been discussed? Thanks again. — SIM, Dec 29 '17 at 16:26

alecxe · Accepted Answer · 2018-01-01T21:38:57.490

Generally speaking, it would be quite difficult (if possible) to reliably cover all the possible arbitrary locations of a profile name on an arbitrary site. The main problem is that you cannot predict what the HTML layout would be on a target site.

One possible alternative way to approach the problem would be to switch from HTML parsing to Natural Language Processing and Named Entity Recognition in particular.

There are few tools to choose from - the StanfordNERTagger from nltk, spacy etc.

Here is a sample using nltk (this answer should help to set things up):

import nltk
import requests
from bs4 import BeautifulSoup

from nltk.tag.stanford import StanfordNERTagger


st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')


links = {
    "https://www.paulweiss.com/professionals/associates/robert-j-agar",
    "http://www.cadwalader.com/index.php?/professionals/matthew-lefkowitz",
    "https://www.kirkland.com/sitecontent.cfm?contentID=220&itemID=12061"
}
for link in links:
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    text = soup.body.get_text()

    for sent in nltk.sent_tokenize(text):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)
        for tag in tags:
            if tag[1] == 'PERSON':
                print(tag)
    print("----------")

Now, this would extract the person names but would also have a lot of noise:

('Mark', 'PERSON')
('Adler', 'PERSON')
('Ellis', 'PERSON')
('Mark', 'PERSON')
('Adler', 'PERSON')
('J.D.', 'PERSON')
('Mark', 'PERSON')
('Adler', 'PERSON')
('Kirkland', 'PERSON')
('Mark', 'PERSON')
----------
('PAUL', 'PERSON')
('Agar', 'PERSON')
('Robert', 'PERSON')
...
('Paul', 'PERSON')
('Weiss', 'PERSON')
('Rifkind', 'PERSON')
----------
('Bono', 'PERSON')
('Bono', 'PERSON')
('ProjectsPro', 'PERSON')
...
('Jason', 'PERSON')
('Schwartz', 'PERSON')
('Jodi', 'PERSON')
('Avergun', 'PERSON')
('Top', 'PERSON')
----------

One of the reasons for this noise is that we are parsing the text of the body of a webpage which, of course, contains lots of irrelevant information.

The overall Named Entity Recognition problem is quite an interesting one and there are a lot of other techniques, like using word2vec for further analysis:

Deep learning ideas are also out there, just to name a few:

NeuroNER (based on TensorFlow)
Named Entity Recognition using multilayered bidirectional LSTM

@Topto ah, hope it would put you on the right track - though, you are entering a rather complicated nlp area - be prepared for challenges :) As far as your error goes - check that you have `stanford-ner` directory in the same directory as the script and you have `english.all.3class.distsim.crf.ser.gz` and `stanford-ner/stanford-ner.jar` files there. Thanks. — alecxe, Jan 01 '18 at 22:42
@Topto glad to hear it worked! Ah, every programmer needs to go through the difficulties of `npm` installations in his life :) I remember I've used `nvm` once, couple times used `brew` on Mac and sometimes installed directly - here is the [relevant documentation page](https://docs.npmjs.com/getting-started/installing-node) - though, I am sure you've seen it already..thanks! — alecxe, Jan 02 '18 at 20:45

J_H · Answer 2 · 2018-01-01T19:40:15.587

You are asking how to scale up to scraping lots of sites with diverse HTML layouts.

You already have three (link, pro_name) tuples, and a disjunction over three relevant CSS selectors, plus the trivial r'.*' regex accessor to extract pro_name from text. Identifying relevant selectors and regexes is the scaling problem. You want to move away from hardcoded selectors, putting them in some sort of datastore.

So the code you have is great for N=3. Here's the code you want to write, to tackle arbitrary N: "given a HTML document containing pro_name, what combination of selector + accessor will reliably extract that pro_name?". To validate such output, you will want to test against one or more links from same site using additional known pro_names. For that matter, you'll want to verify that repeatedly visiting same link will produce same result, since some websites change document details upon browser refresh.

Let selector_list be the CSS selectors bs4 would use to navigate the DOM down from root to leaf node. In your question you essentially post selector_list[-1], the final entry from three such lists.

At training time, start by outputting selector_list, char_offset, word_offset, and boilerplate, where boilerplate text can be mined from multiple site pages and will be incorporated in your regex. The offsets in terms of characters and words are "0" in your posted code, implying empty site boilerplate of "". Then for some family of access functions accepting text plus those four parameters, output candidate accessors (your code adopts "accept r'.*' starting at character offset zero") which are observed to emit the correct pro_name. Validate the accessor against other (document, pro_name) inputs.

At inference time, map link to accessor, and use that to extract pro_name from HTML documents in production.

How to scrape data from different sites with different source code using a single script?

2 Answers2