How do I scrape this text from a 2004 Wayback machine site/why is the code I'm running wrong?

Question

note: I haven't asked a question here before, and am still not sure how to make this legible, so let me know of any confusion or tips on making this more readable

I'm trying to download user information from the 2004/06 to 2004/09 Internet Archive captures of makeoutclub.com (a wacky, now-defunct social network targeted toward alternative music fans, which was created in ~2000, making it one of the oldest profile-based social networks on the Internet) using r,* specifically the rcrawler package. So far, I've been able to use the package to get the usernames and profile links in a dataframe, using xpath to identify the elements I want, but somehow it doesn't work for either the location or interests sections of the profiles, both of which are just text instead of other elements in the html. For an idea of the site/data I'm talking about, here's the page I've been texting my xpath on: https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html

I have been testing out my xpath expressions using rcrawler's ContentScraper function, which extracts the set of elements matching the specified xpath from one specific page of the site you need to crawl. Here is my functioning expression that identifies the usernames and links on the site, with the specific page I'm using specified, and returns a vector:

testwaybacktable <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = c("//tr[1]/td/font/a[1]/@href", "//tr[1]/td/font/a[1]"), ManyPerPattern = TRUE)

And here is the bad one, where I'm testing the "location," which ends up returning an empty vector

testwaybacklocations <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[1]/td/font/text()[2]", ManyPerPattern = TRUE)

And the other bad one, this one looking for the text under "interests":

testwaybackint <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[2]/td/font/text()", ManyPerPattern = TRUE)

The xpath expressions I'm using here seem to select the right elements when I try searching them in the Chrome Inspect thing, but the program doesn't seem to read them. I also have tried selecting only one element for each field, and it still produced an empty vector. I know that this tool can read text in this webpage–I tested another random piece of text–but somehow I'm getting nothing when I run this test. Is there something wrong with my xpath expression? Should I be using different tools to do this? Thanks for your patience!

*This is for a digital humanities project will hopefully use some nlp to analyze especially language around gender and sexuality, in dialogue with some nlp analysis of the lyrics of the most popular bands on the site.

I am not clear what exactly you are trying to scrape but you can look at `rvest` package. It is a very handy package for scrapping. — Ronak Shah, Feb 03 '20 at 02:15
I'm pretty sure that scraping from the Wayback Machine is against the TOS, i.e. illegal. SO's How To Ask advises against asking for assistance with illegal actions. — IRTFM, Feb 03 '20 at 05:10
I looked that up and can't find it anywhere in the TOS, and I've seen some open source tools on Github specifically for crawling the Wayback archive, so I'm pretty sure the Wayback Machine is fine with it. — Ella Markianos, Feb 05 '20 at 00:22

score 2 · Answer 1 · answered Jun 19 '20 at 09:57

A late answer, but maybe it will help nontheless. Also I am not sure about the whole TOS question, but I think that's yours to figure out. Long story short ... I will just try to to adress the technical aspects of your problem ;)

I am not familiar with the rcrawler-package. Usually I use rvest for webscraping and I think it is a good choice. To achive the desired output you would have to use something like

# parameters
url <- your_url
xpath_pattern <- your_pattern
# get the data
wp <- xml2::read_html(url)
# extract whatever you need
res <- rvest::html_nodes(wp,xpath=xpath_pattern)

I think it is not possible to use a vector with multiple elements as pattern argument, but you can run html_nodes for each pattern you want to extract seperately.

I think the first two urls/patterns should work this way. The pattern in your last url seems to be wrong somehow. If you want to extract the text inside the tables, it should probably be something like "//tr[2]/td/font/text()[2]"

How do I scrape this text from a 2004 Wayback machine site/why is the code I'm running wrong?

1 Answers1