0

I have the HTML files of the different versions of the same website and I need to find a way to measure and quantify the change between different versions.

What is a good way to measure a change in HTML files? Is there an established way to do this?

What is a good way to do this at scale using R or Python?

I have tried counting the number of lines and the number of tags in each HTML file. Although I expect this to give me a basic idea about the magnitude of change, I wonder if there is a better way to do this.

bcdeniz
  • 69
  • 1
  • 5
  • 1
    Would this help? https://stackoverflow.com/questions/977491/comparing-two-txt-files-using-difflib-in-python – michjnich Jul 26 '19 at 11:59
  • 1
    Most of the html pages today load data, asynchronously. So I don't think cataloging diff is such a good idea. – LazyCoder Jul 26 '19 at 12:01
  • @SANTOSHKUMARDESAI Can you explain why? Is it because I won't be able to access those loaded data files and won't be able to include them in comparison? – bcdeniz Jul 26 '19 at 19:41
  • 1
    No and Yes. You can use selenium drivers to load and view async data. The comparison metrics would vary time to time. In fact what you have taken up is an entire set of 4-5 teams at Google if I am right. So, if you know a good system of metrics to use, you can implement this over a weekend. That would mean, you need more domain knowledge about the website that you are scraping. – LazyCoder Jul 26 '19 at 20:41

0 Answers0