1

I'm trying to scrape this table titled Battle Styles into a dataframe. https://bulbapedia.bulbagarden.net/wiki/Battle_Styles_(TCG)#Set_lists

The problem is that many of the rows contain images with vital information which isn't being picked up in rvest.

The table should look like this:

No.     Card name   Type    Rarity
001/163 Bellsprout  Grass   Common
002/163 Weepinbell  Grass   Uncommon
003/163 Victreebel  Grass   Rare
004/163 Cacnea      Grass   Common
005/163 Cacturne    Grass   Uncommon
006/163 KricketuneV Grass   Ultra-Rare Rare
007/163 Cherubi     Grass   Common
008/163 Cherrim     Grass   Rare Holo
009/163 Carnivine   Grass   Uncommon
010/163 Durant      Grass   Uncommon

and this table ^^ is what I'm able to get if I copy the table and paste it into notepad.

However mine does not contain any information from the pictures. It looks like this:

     # A tibble: 184 x 6
   No.     Image `Card name` Type  Rarity Promotion
   <chr>   <lgl> <chr>       <chr> <lgl>  <chr>    
 1 001/163 NA    Bellsprout  ""    NA     Promotion
 2 002/163 NA    Weepinbell  ""    NA     Promotion
 3 003/163 NA    Victreebel  ""    NA     Promotion
 4 004/163 NA    Cacnea      ""    NA     Promotion
 5 005/163 NA    Cacturne    ""    NA     Promotion
 6 006/163 NA    Kricketune  ""    NA     Promotion
 7 007/163 NA    Cherubi     ""    NA     Promotion
 8 008/163 NA    Cherrim     ""    NA     Promotion
 9 009/163 NA    Carnivine   ""    NA     Promotion
10 010/163 NA    Durant      ""    NA     Promotion

The information necessary from pictures is in the alt-text, so I feel like the solution should be straight forward, but I can't figure out how to get it.

Here's my code:

library(rvest)

BattlestylesURL <- "https://bulbapedia.bulbagarden.net/wiki/Battle_Styles_(TCG)"

temp <- BattlestylesURL %>% 
  read_html %>%
  html_nodes("table")

html_table(temp[16], fill = TRUE)

I think the biggest headache is that some columns combine images and text and I'm trying to have a dataframe with information from both in the same column. For example, the "Card Name" of row 6 is Kricketune V. 'Kricketune' is text, but the "V" is a picture.

I feel like there should be a simple way of doing it but I can't seem to wrap my head around it. Would greatly appreciate help!

The examples I've found have been similar: Scraping Wikipedia HTML table with images, text, and blank cells with R however, I couldn't figure out how to apply this to this situation because I'm trying to keep the text that was in the row too.

1 Answers1

0

You could grab the table first then update those columns. You can use ifelse for the Type column as the value you want can either be in the th or the child img where present. The interesting bit is in using the right css selectors so as to match only the relevant nodes to update the table with.

library(rvest)
library(tidyverse)

page <- read_html("https://bulbapedia.bulbagarden.net/wiki/Battle_Styles_(TCG)#Set_lists")

df <- page %>%
  html_node(".multicol .roundy tr:nth-child(2) table") %>%
  html_table(fill = T)

df <- subset(df, No. != "") %>% select(-c(Image))

df$Rarity <- page %>%
  html_nodes("td:nth-child(1) tr table td:nth-child(5) a") %>%
  html_attr("title")
   
df$Type <- map(page %>% html_nodes("td:nth-child(1) tr tr:nth-child(n+2) th:nth-child(4)"), function(x) {
  var_node <- x %>%
    html_node("img") %>%
    html_attr("alt")
  ifelse(is.na(var_node), x %>% html_text(trim = T), var_node)
})
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks very much QHarr, it pretty much answered it! Really appreciate it. I'm still a bit confused about how to use the ifelse for the type column but will have a bit more of an investigation. – Daniel Dunn May 18 '21 at 00:17
  • The ifelse is because type is not always held in an alt attribute of a child img. In a few cases it is simply in the td element itself. If you x %>% html_node("img") %>% html_attr("alt") where it does not exist, it returns na, which you can then handle with the ifelse to instead return the td %>% html_text – QHarr May 18 '21 at 00:57