I have been trying this simple task for hours. No available libraries seem to help and no questions here seem to tackle this scenario.
It's fairly simple:
- I have an entire page's markup as a string.
- I need to use CSS selectors to point to the elements I need to scrape the data from.
- I DO NOT want to create actual HTML DOM elements. Only scrape the data from them. The page might contain image, audio, video and other elements that I don't want to create.
- It needs to be able to deal with markup errors and HTML5-style tagging. Currently, trying to parse it as XML throws an "Invalid XML" exception.
- It needs to happen in the browser. So, no NodeJS modules.
In JAVA I've been able to do exactly this using JSoup. But there doesn't seem to be an equivalent library for JS running on a browser.
Thanks for your time.