I have saved pages web pages in text (as .txt files), lots of them. These are public profile pages from a social media site. I want to do a rough measure of how much stuff is on these profile pages. When I save these text files as .html, then open them in a browser, I can see the profile presented. But the text file is a poor indication of how developed the content is on the profile page. If I do character counts on this, it is completely uncorrelated to how developed the viewable profile is (so I learned that html files are such are not good proxies of what shows up when you view the file, since there is a lot of text that does not get rendered in browser windows).
The typical parsing functions from r to extract from .html files seems to drop a lot of the content - I think these profile pages are not very well structured.
I can open these files in an application like chrome from R. But is there a way (programmatically from R) to cut/paste the text rendered in Chrome to another file, as a way of measuring the text that appears in these profiles? I would like to create something automated from R, and loop it.
I'll place a dropbox link to example files (input and output) here -> https://www.dropbox.com/sh/4fqxwbj74tnfaxq/AACtexD7OVYYrMoTDrudbacba?dl=0. In the file, "test2_simple_pagecode.txt", this has the page source code of a sample profile. One could change this to a .html extension, and bring it up in a browser and view the page. What I want to do is bring that file up in a browser window, then cut and paste the text of the entire page into a separate file like the example in "test2_simple_cutpaste.txt". This way, the new file only has words that are actually seen in the profile.