I'm working on a web crawling project where I'd like to start at a main ulr here: https://law.justia.com/codes/
I'd like to ultimately end up with a list of urls that contains actual state code text. For example, if you go to the webpage above, you can navigate to
Montana > 2021 Montana Code > Title 1 > General Provisions > Part 1 > 1-1-101 >
and then you land on a page that does not contain any further link for statute sections and instead has actual statute text. I'd like to collect the url for this page as well as all the other terminal pages.
I've started with the following code
library(rJava)
library(rvest)
library(purrr)
library(Rcrawler)
page <- LinkExtractor(url = "https://law.justia.com/codes/")
page$InternalLinks
new_links <- list()
for(i in 1:9){
output <- LinkExtractor(url = page$InternalLinks[[i]])
new_links[[i]] <- output
}
Which results in new_links, a list that contains lists for the first 9 urls (as a test I started with just 9 urls) and the internal links they contain. So 9 lists of three lists.
And that's where I'm at. I'm not sure where to go from here, I'm assuming it will involve a loop of some kind but I'm struggling to write something that doesn't result in a list of lists of lists of lists...
I'm also not sure yet how I will differentiate the terminal urls from urls that need to be searched for further urls.