For my Uni class we need to make a web scraper and one of the traps is large text files with 4 columns of just 1s and 0s going for thousands of lines. We are expected to determine if text files are "low value" or "low information" but I cannot for the life of me find good resources to guide me in this process.
Asked
Active
Viewed 34 times
0
-
1Just declare all of them as "low value" and if your instructor disagrees, tell them [Yeah ... well ... you know, that's just like, your opinion, man](https://m.youtube.com/watch?v=4LGX8TbvGew). – Kelly Bundy Feb 02 '22 at 20:10
-
I guess it depends on what you define as 'information'. I'm not sure about a specific resource for this, but you could assume that the page is in a human readable language. This would let you make some heuristics as you're parsing - e.g. the average length of each word, the number of words you can find in a dictionary - maybe with [string similarity algorithms?](https://stackoverflow.com/questions/3576211/string-similarity-algorithms) This would let you terminate early if the scrape violates these heuristics. – Tytrox Feb 02 '22 at 20:44