0
<div data-feature-name="title">
    <h1 id="title">
        Give and Take: A Revolutionary Approach to Success
    </h1>

    <span class="author" font-size: 13px; line-height: 17.328125px;" >
    Adam M. Grant Ph.D.
    </span>
</div>

<div data-feature-name="averageCustomerReviews">
    <a href="/Give-Take-Revolutionary-Approach-Success/product-reviews/0670026557/ref=dp_top_cm_cr_acr_txt?showViewpoints=1" >
        183 customer reviews
    </a>
</div>

<div>
    <ul>
        <li>
            <span>
                <span>Kindle</span>
                <span>
                    <span>$11.99</span>
                </span>
            </span >
        </li>

        <li>
            <span>
                <span>Hardcover</span>
                <span>
                    <span>$16.50</span>
                </span>
            </span>
        </li>

        <li>
            <span>
                <span>Paperback</span>
                <span>
                    <span>$12.65</span>
                </span>
            </span>
        </li>

        <li>
            <span>
                <span>Audible</span>
                <span>
                    <span>
                        $23.95
                    </span>
                </span>
            </span>
        </li>
    </ul>
</div> 

How can I write the REGEX expression for extracting the following: The title of the book, The author of the book, and The number of reviews of the book.

Also, how can I write the roadrunner algorithm for this?

  • 1
    it would be easier to get the values with an HTML reader, although it is possible with RegEx. Please show what you've tried otherwise this somewhat sounds like an assignment that you are asking for the answer to. – abc123 Dec 11 '13 at 14:47
  • What tool or language do you use? – Casimir et Hippolyte Dec 11 '13 at 14:57
  • Blahblah information = s/\(.*\)/\1/g I know you can extract data using this, but how can you extract from the example above since it's a nested one? – user3055539 Dec 11 '13 at 15:27

2 Answers2

0

You should not use regex for extraction of html. Use a library for traversing xml/html.

Marcin Szymczak
  • 11,199
  • 5
  • 55
  • 63
0

You cannot parse HTML for generic purpose.

However, if you intend to parse particular sites (if you are writing a specific crawler), you can try this

For title:

/id="title"[^>]*>([^<]*?)<\/h1>/

For author:

/class="author"[^>]*>([^<]*)</

For review number:

/(\d+)\s*customer review/

Many cases will break them, of course. If you want to cope with these cases, you really need a parser

Herrington Darkholme
  • 5,979
  • 1
  • 27
  • 43