0

I would like to cralwer the poems and save with txt from this link, here is some hints:

  1. create folders with name of poet,
  2. save the poems with txt format by clicking poems in the red circle one by one,
  3. file name should be poem titles with extension of txt.

enter image description here

I'm new on web crawler with R, someone could help? I'll appreciate your suggestions or helps.

Code:

library(Rcrawler)
library(rvest)

Rcrawler(Website = 'http://famouspoetsandpoems.com/top_poems.html', no_cores = 4, no_conn = 4, Obeyrobots = TRUE)

page <- LinkExtractor(url = 'http://famouspoetsandpoems.com/top_poems.html', ExternalLInks=TRUE)

page$InternalLinks

Out:

  [1] "http://famouspoetsandpoems.com/"                                      
  [2] "http://famouspoetsandpoems.com/poets.html"                            
  [3] "http://famouspoetsandpoems.com/month_poem.html"                       
  [4] "http://famouspoetsandpoems.com/month_poet.html"                       
  [5] "http://famouspoetsandpoems.com/top_poems.html"                        
  [6] "http://famouspoetsandpoems.com/poets_quotes.html"                     
  [7] "http://famouspoetsandpoems.com/love_poems.html"                       
  [8] "http://famouspoetsandpoems.com/thematic_poems.html"                   
  [9] "http://famouspoetsandpoems.com/thematic_quotes.html"                  
 [10] "http://famouspoetsandpoems.com/thematic_poems/birthday_poems.html"    
 [11] "http://famouspoetsandpoems.com/thematic_poems/death_poems.html"       
 [12] "http://famouspoetsandpoems.com/thematic_poems/mother_poems.html"      
 [13] "http://famouspoetsandpoems.com/thematic_poems/family_poems.html"      
 [14] "http://famouspoetsandpoems.com/thematic_poems/thank_you_poems.html"   
 [15] "http://famouspoetsandpoems.com/thematic_poems/sympathy_poems.html"    
 [16] "http://famouspoetsandpoems.com/thematic_poems/retirement_poems.html"  
 [17] "http://famouspoetsandpoems.com/thematic_poems/sorry_poems.html"       
 [18] "http://famouspoetsandpoems.com/thematic_poems/angel_poems.html"       
 [19] "http://famouspoetsandpoems.com/thematic_poems/relationship_poems.html"
 [20] "http://famouspoetsandpoems.com/poets/langston_hughes"                 
 [21] "http://famouspoetsandpoems.com/poets/shel_silverstein"                
 [22] "http://famouspoetsandpoems.com/poets/pablo_neruda"                    
 [23] "http://famouspoetsandpoems.com/poets/maya_angelou"                    
 [24] "http://famouspoetsandpoems.com/poets/edgar_allan_poe"                 
 [25] "http://famouspoetsandpoems.com/poets/robert_frost"                    
 [26] "http://famouspoetsandpoems.com/poets/emily_dickinson"                 
 [27] "http://famouspoetsandpoems.com/poets/elizabeth_barrett_browning"      
 [28] "http://famouspoetsandpoems.com/poets/e__e__cummings"                  
 [29] "http://famouspoetsandpoems.com/poets/walt_whitman"                    
 [30] "http://famouspoetsandpoems.com/poets/william_wordsworth"              
 [31] "http://famouspoetsandpoems.com/poets/allen_ginsberg"                  
 [32] "http://famouspoetsandpoems.com/poets/sylvia_plath"                    
 [33] "http://famouspoetsandpoems.com/poets/jack_prelutsky"                  
 [34] "http://famouspoetsandpoems.com/poets/william_butler_yeats"            
 [35] "http://famouspoetsandpoems.com/poets/thomas_hardy"                    
 [36] "http://famouspoetsandpoems.com/poets/robert_hayden"                   
 [37] "http://famouspoetsandpoems.com/poets/amy_lowell"                      
 [38] "http://famouspoetsandpoems.com/poets/oscar_wilde"                     
 [39] "http://famouspoetsandpoems.com/poets/theodore_roethke"                
 [40] "http://famouspoetsandpoems.com/poets_by_nationality.html"             
 [41] "http://famouspoetsandpoems.com/poets_african_american.html"           
 [42] "http://famouspoetsandpoems.com/poets_women.html"                      
 [43] "http://famouspoetsandpoems.com/poets_contemporary.html"               
 [44] "http://famouspoetsandpoems.com/poets_nobel_prize.html"                
 [45] "http://famouspoetsandpoems.com/country/America/American_poets.html"   
 [46] "http://famouspoetsandpoems.com/country/England/English_poets.html"    
 [47] "http://famouspoetsandpoems.com/poets/maya_angelou/poems/492"          
 [48] "http://famouspoetsandpoems.com/poets/shel_silverstein/poems/14836"    
 [49] "http://famouspoetsandpoems.com/poets/pablo_neruda/poems/15705"        
 [50] "http://famouspoetsandpoems.com/poets/e__e__cummings/poems/14130"      
 [51] "http://famouspoetsandpoems.com/poets/robert_frost/poems/528"          
 [52] "http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18847"     
 [53] "http://famouspoetsandpoems.com/poets/emily_dickinson/poems/5212"      
 [54] "http://famouspoetsandpoems.com/poets/langston_hughes/poems/16946"     
 [55] "http://famouspoetsandpoems.com/poets/ezra_pound/poems/18774"          
 [56] "http://famouspoetsandpoems.com/poets/ezra_pound"                      
 [57] "http://famouspoetsandpoems.com/poets/shel_silverstein/poems/14818"    
 [58] "http://famouspoetsandpoems.com/poets/oscar_wilde/poems/11040"         
 [59] "http://famouspoetsandpoems.com/poets/maya_angelou/poems/482"          
 [60] "http://famouspoetsandpoems.com/poets/langston_hughes/poems/16944"     
 [61] "http://famouspoetsandpoems.com/poets/walt_whitman/poems/17543"        
 [62] "http://famouspoetsandpoems.com/poets/robert_frost/poems/530"          
 [63] "http://famouspoetsandpoems.com/poets/william_wordsworth/poems/10951"  
 [64] "http://famouspoetsandpoems.com/poets/mark_strand/poems/11833"         
 [65] "http://famouspoetsandpoems.com/poets/mark_strand"                     
 [66] "http://famouspoetsandpoems.com/poets/w__h__auden/poems/10095"         
 [67] "http://famouspoetsandpoems.com/poets/w__h__auden"                     
 [68] "http://famouspoetsandpoems.com/poets/maya_angelou/poems/496"          
 [69] "http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848"     
 [70] "http://famouspoetsandpoems.com/poets/dylan_thomas/poems/11395"        
 [71] "http://famouspoetsandpoems.com/poets/dylan_thomas"                    
 [72] "http://famouspoetsandpoems.com/poets/ogden_nash/poems/19570"          
 [73] "http://famouspoetsandpoems.com/poets/ogden_nash"                      
 [74] "http://famouspoetsandpoems.com/poets/shel_silverstein/poems/14820"    
 [75] "http://famouspoetsandpoems.com/poets/emily_dickinson/poems/6104"      
 [76] "http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18849"     
 [77] "http://famouspoetsandpoems.com/poets/e__e__cummings/poems/14135"      
 [78] "http://famouspoetsandpoems.com/poets/anna_akhmatova/poems/31"         
 [79] "http://famouspoetsandpoems.com/poets/anna_akhmatova"                  
 [80] "http://famouspoetsandpoems.com/poets/pablo_neruda/poems/15708"        
 [81] "http://famouspoetsandpoems.com/poets/seamus_heaney/poems/12699"       
 [82] "http://famouspoetsandpoems.com/poets/seamus_heaney"                   
 [83] "http://famouspoetsandpoems.com/poets/william_butler_yeats/poems/10173"
 [84] "http://famouspoetsandpoems.com/poets/william_barnes/poems/20551"      
 [85] "http://famouspoetsandpoems.com/poets/william_barnes"                  
 [86] "http://famouspoetsandpoems.com/poets/ted_kooser/poems/17900"          
 [87] "http://famouspoetsandpoems.com/poets/ted_kooser"                      
 [88] "http://famouspoetsandpoems.com/poets/gwendolyn_brooks/poems/4176"     
 [89] "http://famouspoetsandpoems.com/poets/gwendolyn_brooks"                
 [90] "http://famouspoetsandpoems.com/poets/sylvia_plath/poems/18897"        
 [91] "http://famouspoetsandpoems.com/poets/jack_prelutsky/poems/18767"      
 [92] "http://famouspoetsandpoems.com/poets/sara_teasdale/poems/17949"       
 [93] "http://famouspoetsandpoems.com/poets/sara_teasdale"                   
 [94] "http://famouspoetsandpoems.com/poets/charles_bukowski/poems/13062"    
 [95] "http://famouspoetsandpoems.com/poets/charles_bukowski"                
 [96] "http://famouspoetsandpoems.com/poets/allen_ginsberg/poems/8318"       
 [97] "http://famouspoetsandpoems.com/poets/robert_hayden/poems/4406"        
 [98] "http://famouspoetsandpoems.com/poets/william_shakespeare/poems/1317"  
 [99] "http://famouspoetsandpoems.com/poets/william_shakespeare"             
[100] "http://famouspoetsandpoems.com/poets/william_blake/poems/1002"        
[101] "http://famouspoetsandpoems.com/poets/william_blake"                   
[102] "http://famouspoetsandpoems.com/poets/sylvia_plath/poems/18899"        
[103] "http://famouspoetsandpoems.com/poets/jack_prelutsky/poems/18768"      
[104] "http://famouspoetsandpoems.com/poets/walt_whitman/poems/17466"        
[105] "http://famouspoetsandpoems.com/poets/robert_burns/poems/4971"         
[106] "http://famouspoetsandpoems.com/poets/robert_burns"                    
[107] "http://famouspoetsandpoems.com/poets/maya_angelou/poems/494"          
[108] "http://famouspoetsandpoems.com/poets/stephen_crane/poems/13266"       
[109] "http://famouspoetsandpoems.com/poets/stephen_crane"                   
[110] "http://famouspoetsandpoems.com/poets/raymond_carver/poems/4592"       
[111] "http://famouspoetsandpoems.com/poets/raymond_carver"                  
[112] "http://famouspoetsandpoems.com/poets/e__e__cummings/poems/14131"      
[113] "http://famouspoetsandpoems.com/poets/langston_hughes/poems/16947"     
[114] "http://famouspoetsandpoems.com/about_project.html"                    
[115] "http://famouspoetsandpoems.com/privacy_policy.html"                   
[116] "http://famouspoetsandpoems.com/copyright_notice.html"                 
[117] "http://famouspoetsandpoems.com/links_poetry.html"                     
[118] "http://famouspoetsandpoems.com/link_to_us.html"                       
[119] "http://famouspoetsandpoems.com/tell_a_friend.html"                    
[120] "http://famouspoetsandpoems.com/contact_us.html"
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • What have you tried so far? – xwhitelight Jan 08 '21 at 15:47
  • I updated my code, please check, thanks @xwhitelight – ah bon Jan 09 '21 at 01:29
  • You just extract all the links there. Do you understand the structure of HTML? The basic of web scraping is to understand the structure of HTML and get the exact elements you want to scrape. – xwhitelight Jan 09 '21 at 16:33
  • 1
    These tasks are pretty easy for me but I don't think that's helpful to give you the code. That's not what this community is all about. First, I'll suggest you learn HTML structure if you don't have the basic knowledge, then read some docs about `rvest`. After that, try to follow the following hints: – xwhitelight Jan 09 '21 at 16:39
  • 1
    1. The data about poets and poems are in a table, poems are in the second column and poets are in the third column. First, try to find the right table (perhaps there are more than one table on the page). Then extract the second column to get the data about poems, the third for the poets. => get unique poets' name to create directories 2. Get the href attribute of the poems nodes to get the links to the poems' content => Use read_html() to read those links and find a way to extract the content – xwhitelight Jan 09 '21 at 16:44
  • 1
    Thanks a lot for sharing so much hints and suggestion in detail. – ah bon Jan 09 '21 at 16:46
  • Try first and come back later if you can't manage to work. – xwhitelight Jan 09 '21 at 16:49
  • OK, it takes some time for me, I'll try and let you know if I get something. – ah bon Jan 09 '21 at 16:51

1 Answers1

1

This requires quite a lot of knowledge pieces, that I don't think a beginner can connect together. So here is the code, I explained in the comments:

library(rvest)
library(dplyr)

pg <- read_html("http://famouspoetsandpoems.com/top_poems.html")

tbl <- pg %>% 
  html_nodes(xpath = "//table[@width='436']") %>% .[[2]] %>% # the table that has the info about poems and poets is the second one with width equals 436
  html_table(fill = T) %>% # there are blank lines in between poems' rows => need to set fill = T
  setNames(c("top", "poem", "poet")) %>%
  filter(!is.na(top)) %>% # remove blank lines
  mutate(
    link = sapply(poem, function(x) {
      paste0(
        "http://famouspoetsandpoems.com",
        pg %>% html_node(xpath = paste0("//td/a[contains(., \"", x, "\")]")) %>% html_attr("href")
      ) # this is tricky. with each poem title, find the <a> tag has the text is the title and extract the href attribute
    }, USE.NAMES = F)
  )

dir <- "~/poems" # where do you wanna save the result
for (poet in unique(tbl$poet)) dir.create(paste0(dir, "/", poet))

for (i in 1:nrow(tbl)) {
  poem_content <- 
    read_html(tbl$link[i]) %>% # read the link page
    html_nodes(xpath = "//td/div[@style='padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;']/text()") %>%
    html_text(trim = T) # poem lines
  file_path <- paste0(dir, "/", tbl$poet[i], "/", tbl$poem[i], ".txt")
  writeLines(poem_content, con = file_path)
}
xwhitelight
  • 1,569
  • 1
  • 10
  • 19