I want to get the names of the companies in the middle column of this page (written in bold in blue), as well as the location indicator of the person who is registering the complaint (e.g. "India, Delhi", written in green). Basically, I want a table (or data frame) with two columns, one for company, and the other for location. Any ideas?
Asked
Active
Viewed 6,593 times
2
-
What language you want to use? – SIFE Apr 29 '11 at 10:12
-
Preferably R. But Python or PHP is also okay. – user702432 Apr 29 '11 at 16:46
2 Answers
10
You can easily do this using the XML
package in R
. Here is the code
url = "http://www.consumercomplaints.in/bysubcategory/mobile-service-providers/page/1.html"
doc = htmlTreeParse(url, useInternalNodes = T)
profiles = xpathSApply(doc, "//a[contains(@href, 'profile')]", xmlValue)
profiles = profiles[!(1:length(profiles) %% 2)]
states = xpathSApply(doc, "//a[contains(@href, 'bystate')]", xmlValue)

Ramnath
- 54,439
- 16
- 125
- 152
1
This to match titles in blue bold, the trick is to open the source code of page and look what is before and after what are you looking for, then you use regex.
preg_match('/>[a-zA-Z0-9]+<\/a><\/h4><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
echo $matches[$i];
You may check this.