Extract data from multiple links with rvest and purrr

Question

I have a list of links in a DF I'd like to run a function on to extract data from each.

Libraries and data:

library(rvest)
library(tidyverse)

link_df <- tribble(~title, ~episode, ~link,
        "a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
        "b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
        "c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1")

I've tried to take pieces from this and this answer but am missing a step somewhere:

recs_extract <- function(df){
  
  pages <- df %>% map(read_html, url = link)
  
data <- pages %>% 
  map_dfr(. %>% 
          html_nodes(css = "ul li") %>% 
            html_text() %>% 
            tibble(title = .) %>% 
            slice(12:n()-2) %>% 
            separate(col = title,
                     into = c("author", "titles"),
                     sep = "-" ) %>% 
            separate(titles, 
                     into = c(paste("books", 1:15)),
                     sep = ",", 
                     extra = "drop") %>% 
            mutate(across(where(is.character), str_trim)) %>% 
            janitor::remove_empty(which = "cols") %>% 
            pivot_longer(cols = contains("books"),
                         names_to = NULL, 
                         values_to = "Title", 
                         values_drop_na = TRUE)
          )
  
}

This function works with one link:

link_df$link[1] %>% map(recs_extract)

[[1]]
# A tibble: 15 x 2
   author                       Title                                   
   <chr>                        <chr>                                   
 1 J L Carr                     A Month in the Country                  
 2 J L Carr                     Harpole and Foxberrow General Publishers
 3 J L Carr                     The Battle of Pollocks Crossing         
 4 Vasily Grossman              Life and Fate                           
 5 Mr Bingo                     Hate Mail                               
 6 William S Burroughs          Naked Lunch                             
 7 Nancy Mitford                Love in a Cold Climate                  
 8 J Arthur Gibbs               A Cotswold Village                      
 9 Giuseppe Tomasi di Lampedusa The Leopard                             
10 W N P Barbellion             Journal of a Disappointed Man           
11 Lissa Evans                  Their Finest Hour and a Half            
12 Lissa Evans                  Crooked Heart                           
13 Byron Rogers                 The Last Englishman                     
14 Andy Miller                  Tilting at Windmills                    
15 William Golding              Darkness Visible

Do I place in a nested df first? How to run across each link and store?

#doesn't work
link_df %>% 
  group_by(title) %>%
  nest() %>% 
  mutate(data = map(data, recs_extract, link))

Thank you, apologies for the long post.

score 0 · Answer 1 · answered Aug 01 '20 at 03:15

0

You could use map like :

library(dplyr)
library(purrr)

link_df %>% mutate(data = map(link, recs_extract))


# A tibble: 3 x 4
#  title episode link                                                                 data             
#  <chr> <chr>   <chr>                                                                <list>           
#1 a     1       https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country <tibble [15 × 2]>
#2 b     2       https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight <tibble [17 × 2]>
#3 c     3       https://www.backlisted.fm/episodes/3-david-nobbs-1                   <tibble [18 × 2]>

answered Aug 01 '20 at 03:15

Ronak Shah

377,200
20
156
213

Okay, this is what I tried first with the real data and get thrown this error: ``` Error: Problem with `mutate()` input `data`. x `cols` must select at least one column. ℹ Input `data` is `map(link, recs_extract)`. ``` – Corey Pembleton Aug 01 '20 at 12:17
1

I am guessing those links do not have table that you want to extract. Can you go to one link manually and check if they have the information that you are trying to extract present? Can you create a reproducible example for that because this works fine on the data you have shared. – Ronak Shah Aug 01 '20 at 12:25
I'm going to check the links there may be a few that are broken – Corey Pembleton Aug 01 '20 at 16:10
I believe that some of the text is not stored under ul li pattern, I will create a regex and new question for it @ronak-shah – Corey Pembleton Aug 01 '20 at 20:43
I have added a new question here https://stackoverflow.com/questions/63210126/function-to-extract-text-using-rvest-from-multiple-types-of-text-lists – Corey Pembleton Aug 01 '20 at 21:12
Looks like the question is deleted now? – Ronak Shah Aug 02 '20 at 00:49

Extract data from multiple links with rvest and purrr

1 Answers1