1

I have saved pages web pages in text (as .txt files), lots of them. These are public profile pages from a social media site. I want to do a rough measure of how much stuff is on these profile pages. When I save these text files as .html, then open them in a browser, I can see the profile presented. But the text file is a poor indication of how developed the content is on the profile page. If I do character counts on this, it is completely uncorrelated to how developed the viewable profile is (so I learned that html files are such are not good proxies of what shows up when you view the file, since there is a lot of text that does not get rendered in browser windows).

The typical parsing functions from r to extract from .html files seems to drop a lot of the content - I think these profile pages are not very well structured.

I can open these files in an application like chrome from R. But is there a way (programmatically from R) to cut/paste the text rendered in Chrome to another file, as a way of measuring the text that appears in these profiles? I would like to create something automated from R, and loop it.

I'll place a dropbox link to example files (input and output) here -> https://www.dropbox.com/sh/4fqxwbj74tnfaxq/AACtexD7OVYYrMoTDrudbacba?dl=0. In the file, "test2_simple_pagecode.txt", this has the page source code of a sample profile. One could change this to a .html extension, and bring it up in a browser and view the page. What I want to do is bring that file up in a browser window, then cut and paste the text of the entire page into a separate file like the example in "test2_simple_cutpaste.txt". This way, the new file only has words that are actually seen in the profile.

exl
  • 1,743
  • 2
  • 18
  • 27
  • Sounds to me like you may want to open the page in a headless browser. Not sure how you would do that for a local file. – Roman Luštrik May 24 '22 at 23:59
  • Thanks, @RomanLuštrik, I didn't know about headless browsers. I did find this on opening local html with puppeteer (https://stackoverflow.com/questions/47587352/opening-local-html-file-using-puppeteer). wondering if this might spark a productive exploration on how to do this. I'm new to it, so I'll need to start studying it. . . – exl May 25 '22 at 01:04
  • It is possible. It would be simpler to explain if you can provide a sample – Dave2e May 25 '22 at 01:31
  • @Dave2e, that is encouraging to hear, and I have amended the question to include input file, and the end results I'm looking for. – exl May 25 '22 at 14:53

1 Answers1

0

This page relies heavily on javascript to render the page. I suggest looking into rselenium to process the page. RSelenium will be able to process the javascript and you would be able to use the "rvest" package to extract the information of interest.

Here is very quick and very dirty way to extract the information stored in the person’s profile, but there is also a lot of extraneous information stored there also.

It appears that the information in profile is stored as JSON data in a comment in the html code. The example below extracts that comment, removes the unicode character and parses the JSON data.

lines <-readLines("test2_simple_pagecode.txt")
alllines <- paste(lines, collapse = " ")

library(stringr)

output<-stringr::str_extract(alllines, "<!--\\{\"content\"\\:\\{\"Notes\".+?-->")
nchar(output)

output2<-gsub("\\\\u002d", " ", output)
jsonlite::parse_json(substr(output2, 5, nchar(output2)-3))
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • I think that this is a text parsing approach, but I find that the tough thing about this is that the string to seach and anchor on can change from profile to profile. I think that this is related to the heavy reliance on javascript to render the pages. This is why I think that text parsing could be hard going, and I'd like to just use a tool that would bring up the page as an hmtl file in a browser, then do some kind of character count as displayed in a browser, then return the value. Can RSelenium do this when the target is looking at a local file? – exl May 26 '22 at 14:24
  • I was able to learn a lot from this example, and fashion a pretty decention solution. Thank you. I think thre is much more I need to learn about Selenium and general html parsing, especially in the case of having heavy javascript pages. Thank for this invaluable guidance! – exl May 26 '22 at 16:13
  • Data is stored as JSON in an html comment, I expect the anchor between to stay relative constant or pull all of the comments and test them. Yes RSelenium should be able to process the pages offline. Depending on your system it just may be a hassle getting it up and going. – Dave2e May 26 '22 at 22:38