Parsing html text to obtain input fields

Question

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)

I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.

Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).

Thanks in advance

Writing an HTML parser yourself is just asking for trouble. Use a library, as you mentioned above, or consider augmenting your implementation. — Kon, Jun 11 '14 at 15:46
If you have a proper document (i.e, XHTML) you can use XPath and/or XSLT. Otherwise constructing the Dom might be best. However, If you create the text yourself it might be easier to get the relevant information at this stage. — Fabian, Jun 11 '14 at 15:49

score 1 · Accepted Answer · answered Jun 11 '14 at 18:41

1

Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

answered Jun 11 '14 at 18:41

Artem Zhirkov

157
12

Parsing html text to obtain input fields

1 Answers1