SO. I'm trying to make a text-based web-browser in C and for this I'm using sockets to make HTTP calls, I managed to retreive .html
files from servers I want. This is an example of such:
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Wed, 27 May 2015 03:57:40 GMT
Etag: "359670651"
Expires: Wed, 03 Jun 2015 03:57:40 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (ftw/FBE4)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 50px;
background-color: #fff;
border-radius: 1em;
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
body {
background-color: #fff;
}
div {
width: auto;
margin: 0 auto;
border-radius: 0;
padding: 1em;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
However I now need to 'clean' up the whole thing and for this I'd like to retreive some things:
- Title of page. (Between tags
<title></title>
- Titles of paragraphs. (Between tags
<div><h1><\h1><div>
) - Text of paragraphs. (Between tags
<p></p>
)
I've been trying using the Online Regex Tester and so far I've only managed to get the title of page using regex: <title>(.*)</title>
and it works, but when I use regex:
<title>(aA-zZ)*</title>
I've got no matches (WHY?).
Then I try to get everything that's between the <body></body>
tags so that I can process it later and get the 3 points I mentioned earlier but when I use <body>(.*)</body>
I get no matches (WHY?).
Hope you guys can help me. Thanks!
EDIT REGARDING DUPLICATE QUESTION:
My question is not repeated, I'm trying to find some help with a regular expression that would catch text between <div>
. I do know it's not the right way to do it, I know there exist other libraries, but I must do it this way.
I'm not trying to form up a tree with all the tags, my problem is very specific, I just need the text between some tags. My problem is to find the right Regular Expression.
For the title tags <title></title>
I have the RegEx: <title>([A-Z a-z]*)</title>
.
For the paragraph tags <p></p>
I have the RegEx: [\\r\\n\\t]*<p>([a-zA-Z. \\r\\n ]+)</p>[\\r\\n\\t]*
.
Now I need only help with the <div>
tags. Thanks!