HTML, while it resembles XML, is not actual XML.
This answer lists counterexamples why calid HTML can be invalid XML. A shortened summary:
- Some closing tags can be omitted.
<script>
escape magic
- Attributes without values (boolean attributes)
- Attributes without quotes
- Implicit open elements and multiple top level elements.
If any of these things are found in your HTML file, then it is valid HTML but invalid XML. Which means that you cannot parse this HTML as if it were XML.
(e.g. *.xml, *.html, *.txt or whatever the extension of the file may be, they all are node based)
You're correct when you say that the file extension has no bearing on something being considered correct XML. Only the contents of the file are relevant.
A file extension is relatively meaningless, at least from a technical perspective. The only functional value of a file extension is that it allows Windows to identify what application it should use (by default) when you try to open the file.
As far as your code is concerned, the file extension (or lack thereof) is irrelevant (other than needing it as part of the filepath, to find the file on disk, of course).
If not, what is the best alternative for it. Below are few things that I want to do with my files.
Suggesting resources is out of scope for Stack Overflow. Google is your friend, look for libraries that help you parse HTML.