3

I am writing an HTML to Markdown converter in Rust, using Kuchiki to get access to the parsed tree from html5ever.

For unknown HTML tags, I want to provide the possibility to ignore them and pass them through to the output string, but still processing their children as normal. For that, I need the textual representation of the tag without its contents, but I can't figure how best to do that.

The best I can come up with is:

  1. Clone the node
  2. Drop its children
  3. Call node.to_string
  4. "parse" the string with a regular expression to separate the opening and closing tags.

I feel there must be a better way. I don't think Kuchiki provides this functionality out of the box, but I also don't know how to get access to the html5ever API through Kuchiki, and I also don't get from the html5ever API documentation whether they would provide some functionality like this.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I have never used kuchiki, but from looking at the documentation, maybe the following would work: since we are talking about HTML tags, you should be able to obtain the underlying element by calling `as_element` on the `Node`. That way you can obtain an `ElementData`, which has the fields `name` and `attributes`. I guess the `name` should be equal to the tag, so you could combine that with the attributes to reconstruct the HTML. This is just a guess, however. – aochagavia Feb 11 '17 at 15:33
  • thanks for answering. I know I can reconstruct a html tag, but ideally I would just get the part from the input-stream exactly, which might differ from a reconstruction, as if the software didn't consider this part of the text something that needed to be parsed. I realize this could only work if html5ever keeps track somehow of which text positions were responsible for each node in the parsed tree, and I haven't seen any proof of that. –  Feb 11 '17 at 19:20
  • Just an idea I didn' test: If nothing made a copy of the input string during parsing and construction of the nodes into the form you own, then the `&str` of the node name would be pointing into the original input. If that were true you would have the text positions you spoke about. – Jan Zerebecki Feb 26 '17 at 01:18
  • Could you give an example of what input you want vs what output you expect? You can't get open/close tags exactly, because by the time Kuchiki gets the tags, those values have been abstracted away. I.e. html doesn't care if you have open/closed or auto close tags. – Daniel Fath Dec 24 '19 at 16:42
  • Could this https://users.rust-lang.org/t/get-tag-name-with-kuchiki-html5ever help you? – rofrol Jan 07 '20 at 11:12

0 Answers0