0

I have a bunch of 40k lines HTML files and I need to extract only sentences from it, so I want to automate this process. Text is located inside such blocks

<div class="text">...</div>

How do I search for these blocks and extract data between them to another file?

Viacheslav Yankov
  • 988
  • 10
  • 19

1 Answers1

1

If the files are truly HTML files (e.g. They are the source of an actual webpage). Your best bet is to use HtmlAgilityPack which, despite it's age, is still incredibly robust (https://html-agility-pack.net/).

Your code to load the file and get all divs with the class of text would be :

var doc = new HtmlDocument();
doc.Load(filePath);
doc.DocumentNode.SelectNodes("//div[@class='text']");

SelectNodes simply takes an XPath string so it's easy enough to manipulate and the documentation is pretty good!

MindingData
  • 11,924
  • 6
  • 49
  • 68