I'm trying to get certain data from a webpage. I'm using Python and urllib to get this information, but this data is surrounded with a load of useless information. I figured it out that the best solution to get this information, is to use regular expression.
I'm looking for the name "Huisman, D.J." in the following string of text. This text is already a selection of the full text:
\n \n \n</div>\n <div class="col-sm-8 col-md-6" id="id12">\n
<div>\n \n <div class="col-xs-11">\n
<div>Huisman, D.J.</div>\n</div>\n \n </div>\n
</div>\n \n </div>\n</div>\n </div><div
id="id13">\n <div id="id14">\n \n <div class="row">\n
<div class="col-sm-2 col-md-2">\n
I try the following two expressions. The first I try to select between div and /div. expression is:
r'<div>+(.*?)</div>'
But it fails because before the div where I want it to select from, there is already a div. So I get:
['\\n \\n <div class="col-xs-11">\\n <div>Huisman, D.J.']
So I thought, maybe I can select on the first capital that exist until /div but it select after the first capital. Code and result:
#expression:
r'[A-Z]+(.*?)</div>'
#result
['uisman, D.J.']
Can somebody help me?