1

I am trying to find the following regular expressions to implement to a program of mine to parse a given html file. Could you help me with any of those?

<div>
<div class=”menuItem”> 
<span> 
class=”emph” 
Any string beginning with < and ending with >, i.e. all tags. 
The contents of the body tag.
The contents of all divs 
All divs that make menus

I have managed to figure out that the single div tag is simply " < div >" and the "all tags expression is <(\"[^\"]*\"|'[^']*'|[^'\">])*>

Do you think you could help me with any of the rest? Thank you in advance guys...

I know that HTML parsing is an already solved problem and that regex is not efficient, however it is requested that I do this like this, in order to demonstrate how regular expressions can work by making them (sometimes) long and detailed. That's why I'm simply handling the HTML file I have as a simple text file and I need to apply those regular expressions on it.

Oleks
  • 31,955
  • 11
  • 77
  • 132
Alex Encore
  • 299
  • 1
  • 13
  • 26
  • 3
    [Don't use regex to parse HTML. You'll drive yourself insane.](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) – Michael Fredrickson Mar 01 '12 at 13:16
  • I know but this is part of an assignment and it is requested that I do it like this... :/ – Alex Encore Mar 01 '12 at 13:18
  • 2
    @AlexEncoreTr: It's probably a trick question. If you try to answer it without suggesting an alternative to regular expressions, you automatically fail the course. – Mark Byers Mar 01 '12 at 13:18
  • 2
    Then your teacher is H.P. Lovecraft... – Michael Fredrickson Mar 01 '12 at 13:18
  • 1
    `"in order to demonstrate how regular expressions can work by making them (sometimes) long and detailed"` and impossible to read and unmaintainable and the cause of actual nightmares – Gareth Mar 01 '12 at 13:24
  • I know this might give the impression that I'm completely clueless about how the alternatives to this problem are (and that's very likely to be the case!) but the exercise clearly states at some point that I should also "should display the line number of all of the occurrences of each of the strings below. Also use start and end methods to display index values." – Alex Encore Mar 01 '12 at 13:25
  • I assume that this is something that regex can definitely do, and since there hasn't been any implication that a different approach should be taken, I see no harm in giving it a chance with the regex! It should be fun anyway, and I also know that I won't fail it's not worth much ;) – Alex Encore Mar 01 '12 at 13:26
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – KARASZI István Mar 01 '12 at 13:29
  • Anyway, if it's not the duplicate of parsing (X)HTML with RegEx, it should be closed as not constructive. Sorry Alex, but regular expressions are not for this. – KARASZI István Mar 01 '12 at 13:32
  • Just point out to the teacher that illustrates how html is not a regular language. – Phil H Mar 01 '12 at 13:32
  • 1
    Oh people I appreciate your answers but I tell you, those are things that I have researched and I know about the already existing libraries and I know that html is not a regular language and all this... but then again I can't fail this stupid assignment just because the lecturer is retarded. I just ask for someone to have the time and mood to help me with at least one more expression! :P – Alex Encore Mar 01 '12 at 13:35
  • @Alex Then fail the course! Instead of producing a regex-based HTML parser, write 5000 words on why regex shouldn't be used to parse HTML and submit that instead. And there's plenty of good material you can cite. – El Ronnoco Mar 01 '12 at 15:09
  • I would prefer to avoid such an immature and cocky approach. Thanks. While you are thinking of ways to boycott this project, I managed to end up with almost all the regular expressions I required. The only ones left, are the three last. I'm working on it though. – Alex Encore Mar 01 '12 at 20:00

1 Answers1

4

Please, for your own sanity, consider using an HTML parser library for the language you are using. Regexps are not a suitable tool for this application - they cannot reliably or cleanly handle structured data like HTML.

https://stackoverflow.com/a/1732454/457201

Community
  • 1
  • 1
D_Bye
  • 869
  • 6
  • 10
  • Ah, just seen your reasons for wanting to use regexps. But once your class has finished, the advice stands. Er, good luck... – D_Bye Mar 01 '12 at 13:27
  • Oooooh I DO know that I have to be lazy and use an already existing library. That's what a programmer would do! But unfortunately, this is an introductory assignment to regular expressions and I must do this like that. I hate our lecturer for this, because I know that the next laboratory will introduce the regular expression libraries, but for now, I have to do it the hard and stupid way . – Alex Encore Mar 01 '12 at 13:28