-1

SO. I'm trying to make a text-based web-browser in C and for this I'm using sockets to make HTTP calls, I managed to retreive .html files from servers I want. This is an example of such:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Wed, 27 May 2015 03:57:40 GMT
Etag: "359670651"
Expires: Wed, 03 Jun 2015 03:57:40 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (ftw/FBE4)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

However I now need to 'clean' up the whole thing and for this I'd like to retreive some things:

  1. Title of page. (Between tags <title></title>
  2. Titles of paragraphs. (Between tags <div><h1><\h1><div>)
  3. Text of paragraphs. (Between tags <p></p>)

I've been trying using the Online Regex Tester and so far I've only managed to get the title of page using regex: <title>(.*)</title> and it works, but when I use regex:

<title>(aA-zZ)*</title> I've got no matches (WHY?).

Then I try to get everything that's between the <body></body> tags so that I can process it later and get the 3 points I mentioned earlier but when I use <body>(.*)</body> I get no matches (WHY?).

Hope you guys can help me. Thanks!


EDIT REGARDING DUPLICATE QUESTION: My question is not repeated, I'm trying to find some help with a regular expression that would catch text between <div>. I do know it's not the right way to do it, I know there exist other libraries, but I must do it this way.

I'm not trying to form up a tree with all the tags, my problem is very specific, I just need the text between some tags. My problem is to find the right Regular Expression.

For the title tags <title></title> I have the RegEx: <title>([A-Z a-z]*)</title>.

For the paragraph tags <p></p> I have the RegEx: [\\r\\n\\t]*<p>([a-zA-Z. \\r\\n ]+)</p>[\\r\\n\\t]*.

Now I need only help with the <div> tags. Thanks!

David Merinos
  • 1,195
  • 1
  • 14
  • 36
  • You better use a html parser, I'd recommend libxml2 which has a html parser. – Iharob Al Asimi May 28 '15 at 02:02
  • @iharob Thanks for your suggestion, however, I only need to get a few tags out of the whole HTML code, I believe it's possible, I made this question so that someone would help me out with the Regular Expression. – David Merinos May 28 '15 at 02:32
  • I know, but in your case I would use `strstr()`. – Iharob Al Asimi May 28 '15 at 02:52
  • @DavidMerinos It may be *possible* (which in this case, turns out to have a pretty *loose* meaning), but that doesn't necessarily make it wise. Consider that there are *other* libraries out there that are specifically designed for parsing webpages. – autistic May 28 '15 at 03:17
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – autistic May 28 '15 at 03:19

1 Answers1

1

Where

<title>(aA-zZ)*</title>

is concerned, I think you are missing a couple of concepts. Parenthesis () are for capture groups. (aA-zZ) would match a literal aA-zZ. The asterisk after a capturing group is not, I think, meaningful; at most, it would capture zero or more instances of aA-zZ.

I think you are looking for

<title>[A-Za-z ]*</title>

Square brackets [] match anything within them, including ranges. [A-Za-z ]* matches zero or more instances of uppercase or lowercase letters and spaces.

Where

<body>(.*)</body>

is concerned, the problem is likely that your input has carriage returns in it. Most tools, presumably including this website, stop looking for matches at the end of a line. Different tools have different workarounds for this.

Politank-Z
  • 3,653
  • 3
  • 24
  • 28
  • Marking this as an aswer because it was the only one who really saw the point of the question. Thank you. I managed to get some H1/2/3... tags as well. With a little bit of work I could get tags. – David Merinos May 29 '15 at 04:12