Extract Text from a Webpage

Question

Assume I want to extract customer reviews from a site like bestbuy.com or walmart.com. Suppose a fragment of the reviews page looks like this:

<div class="BVRRReviewTitleContainer"><span class="BVRRLabel BVRRReviewTitlePrefix"></span> <h2>
<span itemprop="name" class="BVRRValue BVRRReviewTitle">Perfect size for the kids and durable</span> </h2>
<span class="BVRRLabel BVRRReviewTitleSuffix">, </span></div>
<div class="BVRRReviewDateContainer"><span class="BVRRLabel BVRRReviewDatePrefix"></span><span class="BVRRValue BVRRReviewDate">11/22/2013<meta itemprop="datePublished" content="2013-11-22"/></span><span class="BVRRLabel BVRRReviewDateSuffix"></span></div>
<div class="RRBeforeUserContainerSpacer"></div>
<div class="BVRRUserNicknameContainer"><span class="BVRRLabel BVRRUserNicknamePrefix">By </span><span class="BVRRValue BVRRUserNickname"><span itemprop="author" class="BVRRNickname">wilbuh </span></span> <span class="BVRRLabel BVRRUserNicknameSuffix">,</span>
<div class="BVRRUserLocationContainer"><span class="BVRRLabel BVRRUserLocationPrefix"></span><span class="BVRRValue BVRRUserLocation">Oakland, ME</span><span class="BVRRLabel BVRRUserLocationSuffix"></span></div></div>
<div class="BVRROverallRatingContainer" >
<div class="BVRRRatingContainerStar"><div class="BVRRRatingEntry BVRROdd"><div id="BVRRRatingOverall_Review_Display" class="BVRRRating BVRRRatingNormal BVRRRatingOverall"><div class="BVRRLabel BVRRRatingNormalLabel"></div><div class="BVRRRatingNormalImage">
<div class="BVImgOrSprite" style="width:75px;height:15px;overflow:hidden"><img src="http://walmart.ugc.bazaarvoice.com/1336/5_0/9/rating.png" alt="5 out of 5" title="5 out of 5" width="135" height="15" />
</div></div>
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating" class="BVRRRatingNormalOutOf"> <span itemprop="ratingValue" class="BVRRNumber BVRRRatingNumber">5</span>
<span class="BVRRSeparatorText">out of</span>
<span itemprop="bestRating" class="BVRRNumber BVRRRatingRangeNumber">5</span>
</div></div></div></div> </div>
<div class="RRReviewDisplayStyle2BeforeContentContainerSpacer"></div>
<div class="BVRRReviewDisplayStyle2ContentContainer">
<div class="BVRRReviewTextContainer"><div class="BVRRReviewTextParagraph BVRRReviewTextFirstParagraph BVRRReviewTextLastParagraph"><span itemprop="description" class="BVRRReviewText">Bought this tablet for my kids after I purchased a no name brand and it did not perform well at all. I have the 10.1, and absolutely love it and so this 7&quot; was the perfect compliment to it. Its an amazing tablet, easy to use, and durable for my 5 and 7 year old kids.</span>

Is it possible to extract the review title ("Perfect size for the kids and durable") and the review description ("Bought this tablet for my kids after I purchased a no name brand and it did not perform well at all. I have the 10.1, and absolutely love it and so this 7"; was the perfect compliment to it. Its an amazing tablet, easy to use, and durable for my 5 and 7 year old kids.")? I am looking to automate the process to extract all reviews titles and descriptions.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags don't know if r can parse html, but I suggest you not to use regexes for this task — Gabber, Nov 27 '13 at 13:34
Regular expressions are great for some tasks, but they have their limits. If what you are looking for is to strip out or capture something that has very clear non recursive patterns then you can use it.. However if the page is xhtml as in well formed XML then use XSLT or XQuery, XPath, as those will be able to use the structure of the data to make smarter and more reliable means of getting exactly what you want. — Rob, Nov 27 '13 at 13:39
You xml is corrupted. easier to give the page link with the review to extract the data using `XML` package. — agstudy, Nov 27 '13 at 14:03
Of course you can do that very well with RegExp - but well that's already all about it. It is not an R question but a trivial RegEx question - which I vote for closing it. — Raffael, Nov 27 '13 at 14:16
Don't use regex to parse XML, or [this may happen to you](http://stackoverflow.com/a/1732454/271616). — Joshua Ulrich, Nov 27 '13 at 14:18
possible duplicate of [Regex matching everything that's not a 4 digit number](http://stackoverflow.com/questions/12115566/regex-matching-everything-thats-not-a-4-digit-number) - or pretty much any question about RegEx with R — Raffael, Nov 27 '13 at 14:18
@JoshuaUlrich: well, you are just giving a modern interpretation of "cargo cult" as you didn't understand apparently what bobince is talking about. bobince is referring to situations where fully-working XML parsing has to be relied on - like for security reasons. Of course keyword-parsing of HTML code with RegEx for web scraping is fine. Most HTML code isn't well-formed anyway so you just run into issues trying to treat it like an XML document — Raffael, Nov 27 '13 at 14:21
(@JoshuaUlrich: the post you are referring to is BTW one of my favorite SO posts ever) — Raffael, Nov 27 '13 at 14:24
@Gabber I don't see where the OP is talking about regular expressions? — agstudy, Nov 27 '13 at 14:38
Thanks to everyone for your recommendations and comments. Very much appreciated. — Marc Moroccoholic, Nov 27 '13 at 14:45

score 3 · Accepted Answer · answered Nov 27 '13 at 14:36

The question is a simple xpath exercise. But you XML file is corrupted. it misses some "div" tags. I correct it and you can find the new version in this gist

library(XML)
doc <- xmlParse(file='test.xml')

xpathSApply (doc,'//*[@class="BVRRValue BVRRReviewTitle"]',xmlValue)
[1] "Perfect size for the kids and durable"

xpathSApply (doc,'//*[@class="BVRRReviewTextContainer"]',xmlValue)
[1] "Bought this tablet for my kids after I purchased a no name brand and it 
     did not perform well at all. I have the 10.1, and absolutely 
     love it and so this 7\" was the perfect compliment to it. 
     Its an amazing tablet, easy to use, and durable for my 5 and 7 year old kids."

Extract Text from a Webpage

1 Answers1