I am looking to create a data.frame in R from a table found at http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more
It consists of three columns. The first two columns may or may nor contain a flag image, the third is text. An extract is
<span class="listings">
<table>
<tr>
<td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
<td></td>
<td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/70187567">1000 Ways to Die - Season 3</a> (2010)</b> <i style="font-size:small"> 3.6 stars, 1 Season <a target="_blank" href="http://www.imdb.com/search/title?title=1000 Ways to Die - Season 3">imdb</a></i>
</td>
</tr>
<tr>
<td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
<td><img class="flag" src="http://bit.ly/WXvnLp" /></td>
<td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/100_Below_Zero/70273426?trkid=1889703">100 Below Zero</a> (2013)</b> <i style="font-size:small"> 2.8 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=100 Below Zero">imdb</a></i></td>
</tr>
</table>
</span>
So here the first row has an image in the first column only, the second row has them in both. I can extract the text and image url but cannot match them up to take account of missing data. Here is what I have done to date - theURL refers to above site and I have just shown results from extract
library(XML)
myURL <- "http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more"
basicInfo <- htmlParse(myURL, isURL = TRUE)
### text
df <- readHTMLTable(myURL,header=c("flag1","flag2","movie"), stringsAsFactors = FALSE)[[1]]
head(df,2)
# V1 V2 V3
# 1 1000 Ways to Die - Season 3 (2010) 3.6 stars, 1 Season imdb
# 2 100 Below Zero (2013) 2.8 stars, 1hr 28m imdb
### images
xpathSApply(basicInfo, "//*/span[@class='listings']/table/tr/td/img/@src")
# src src src
#"http://bit.ly/Y9CbVZ" "http://bit.ly/Y9CbVZ" "http://bit.ly/WXvnLp"
So I have the images but do not know which row/column they apply to In this problem, each column can only have a one specific image so it is sufficient to know whether it occurs. A more general case might have different srcs by row
TIA