note: I haven't asked a question here before, and am still not sure how to make this legible, so let me know of any confusion or tips on making this more readable
I'm trying to download user information from the 2004/06 to 2004/09 Internet Archive captures of makeoutclub.com (a wacky, now-defunct social network targeted toward alternative music fans, which was created in ~2000, making it one of the oldest profile-based social networks on the Internet) using r,* specifically the rcrawler package. So far, I've been able to use the package to get the usernames and profile links in a dataframe, using xpath to identify the elements I want, but somehow it doesn't work for either the location or interests sections of the profiles, both of which are just text instead of other elements in the html. For an idea of the site/data I'm talking about, here's the page I've been texting my xpath on: https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html
I have been testing out my xpath expressions using rcrawler's ContentScraper function, which extracts the set of elements matching the specified xpath from one specific page of the site you need to crawl. Here is my functioning expression that identifies the usernames and links on the site, with the specific page I'm using specified, and returns a vector:
testwaybacktable <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = c("//tr[1]/td/font/a[1]/@href", "//tr[1]/td/font/a[1]"), ManyPerPattern = TRUE)
And here is the bad one, where I'm testing the "location," which ends up returning an empty vector
testwaybacklocations <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[1]/td/font/text()[2]", ManyPerPattern = TRUE)
And the other bad one, this one looking for the text under "interests":
testwaybackint <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[2]/td/font/text()", ManyPerPattern = TRUE)
The xpath expressions I'm using here seem to select the right elements when I try searching them in the Chrome Inspect thing, but the program doesn't seem to read them. I also have tried selecting only one element for each field, and it still produced an empty vector. I know that this tool can read text in this webpage–I tested another random piece of text–but somehow I'm getting nothing when I run this test. Is there something wrong with my xpath expression? Should I be using different tools to do this? Thanks for your patience!
*This is for a digital humanities project will hopefully use some nlp to analyze especially language around gender and sexuality, in dialogue with some nlp analysis of the lyrics of the most popular bands on the site.