How to write a REGEX to extract information from HTML

Question

<div data-feature-name="title">
    <h1 id="title">
        Give and Take: A Revolutionary Approach to Success
    </h1>

    <span class="author" font-size: 13px; line-height: 17.328125px;" >
    Adam M. Grant Ph.D.
    </span>
</div>

<div data-feature-name="averageCustomerReviews">
    <a href="/Give-Take-Revolutionary-Approach-Success/product-reviews/0670026557/ref=dp_top_cm_cr_acr_txt?showViewpoints=1" >
        183 customer reviews
    </a>
</div>

<div>
    <ul>
        <li>
            <span>
                <span>Kindle</span>
                <span>
                    <span>$11.99</span>
                </span>
            </span >
        </li>

        <li>
            <span>
                <span>Hardcover</span>
                <span>
                    <span>$16.50</span>
                </span>
            </span>
        </li>

        <li>
            <span>
                <span>Paperback</span>
                <span>
                    <span>$12.65</span>
                </span>
            </span>
        </li>

        <li>
            <span>
                <span>Audible</span>
                <span>
                    <span>
                        $23.95
                    </span>
                </span>
            </span>
        </li>
    </ul>
</div>

How can I write the REGEX expression for extracting the following: The title of the book, The author of the book, and The number of reviews of the book.

Also, how can I write the roadrunner algorithm for this?

it would be easier to get the values with an HTML reader, although it is possible with RegEx. Please show what you've tried otherwise this somewhat sounds like an assignment that you are asking for the answer to. — abc123, Dec 11 '13 at 14:47
Blahblah information = s/\(.*\)/\1/g I know you can extract data using this, but how can you extract from the example above since it's a nested one? — user3055539, Dec 11 '13 at 15:27

score 0 · Answer 1 · answered Dec 11 '13 at 14:48

0

You should not use regex for extraction of html. Use a library for traversing xml/html.

answered Dec 11 '13 at 14:48

Marcin Szymczak

11,199
5
55
63

I can do the extraction without using regex but the problem set requires the freakin' regex expression. Help, please? – user3055539 Dec 11 '13 at 14:53
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Marcin Szymczak Dec 11 '13 at 14:55

score 0 · Answer 2 · answered Dec 11 '13 at 14:57

0

You cannot parse HTML for generic purpose.

However, if you intend to parse particular sites (if you are writing a specific crawler), you can try this

For title:

/id="title"[^>]*>([^<]*?)<\/h1>/

For author:

/class="author"[^>]*>([^<]*)</

For review number:

/(\d+)\s*customer review/

Many cases will break them, of course. If you want to cope with these cases, you really need a parser

answered Dec 11 '13 at 14:57

Herrington Darkholme

5,979
1
27
43

And I assume you use PCRE – Herrington Darkholme Dec 11 '13 at 14:57
How can I write a wrapper for the above HTML page to extract all the useful data fields, including #TITLE, #AUTHOR, #REVIEW_COUNT, #FORMAT, #PRICE? – user3055539 Dec 11 '13 at 15:26
What language are you using? – Herrington Darkholme Dec 11 '13 at 15:41
Nothing in particular, I just need the regex and wrapper. – user3055539 Dec 11 '13 at 15:42
lol, how do you expect a wrapper without specifying a language? – AeroX Dec 11 '13 at 16:14

How to write a REGEX to extract information from HTML

2 Answers2