2

I want to use regex from a source file named source.html or source.txt:

<OPTION value=5>&nbsp;&nbsp;5 - Course Alpha (3)</OPTION> <OPTION value=6>&nbsp;&nbsp;6 - Course Beta (3)</OPTION>

to get:

5 - Course Alpha (3)
6 - Course Beta (3)

I mean I have to find a pattern:

<OPTION v

and

 finding first number after it 

so getting everything till I see:

</OPTION>

How can I implement it with Perl using Regex?

PS: It should read the content from a file and write output to a file.

Toon Krijthe
  • 52,876
  • 38
  • 145
  • 202
kamaci
  • 72,915
  • 69
  • 228
  • 366

3 Answers3

4

You do not want to use a regex, you want to use an HTML parser. Here's a good article on the subject which explains why regexes are fragile and how to use HTML::TreeBuilder.

There's also a small pile of similar questions and answers about extracting data from HTML documents.

Community
  • 1
  • 1
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Thanks for your answers and ideas. I am reading them. Just one thing more, I can use any other libraries however I want to implement regex for this. – kamaci Apr 13 '11 at 14:16
  • Schwern, while I support your position about how challenging it can be to use regexes on *general, open-ended* HTML, we do not know enough about the original poster’s situation to know whether that position is fully justified and applicable in his case. As I know you know, regexes on *discrete and well-accounted-for* HTML snippets is perfect reasonable, and indeed at times even preferred. If they can control or otherwise delineate the limits of the problem-space to a small enough subset, a regex is a lot easier than a parsing approach, but if they cannot, then it is not. Agreed? – tchrist Apr 13 '11 at 14:30
  • 1
    @tchrist Yeah. I'm still not trusting him with a gun. – Schwern Apr 13 '11 at 23:25
1
perl -lwe '$_="<OPTION value=5>&nbsp;&nbsp;5 - Course Alpha (3)</OPTION> <OPTION value=6>&nbsp;&nbsp;6 - Course Beta (3)</OPTION>"; s/\&nbsp;//g; print $1 while /<OPTION [^>]*>([^<]+)/g'
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
CBO
  • 82
  • 5
  • You should format code by putting four spaces before each line. You can also select it and click the `{}` button. More helpful tips at the [Markdown Editing Help](http://stackoverflow.com/editing-help) page. – R. Martinho Fernandes Apr 13 '11 at 14:04
  • I tried perl -0777ne s/\ //g; "print $1 while / – kamaci Apr 13 '11 at 15:17
0

What about

/<OPTION v.*?>.*?(\d.+?)<\/OPTION>/

http://regexr.com?2thm8

There you will find your strings in the first capturing group.

stema
  • 90,351
  • 20
  • 107
  • 135
  • It would be nice to get a reason for down votes. Otherwise its not possible that I improve my answer, and I am also just a human. OK, I recognised an error and will change it. – stema Apr 13 '11 at 14:04
  • @stema thanks for your answer. How can I check it with website that you wrote here? – kamaci Apr 13 '11 at 14:09
  • 1
    You gave the right answer to the wrong question. The OP does not know that using regexes to parse HTML is a bad idea, a straight answer is not helpful. – Schwern Apr 13 '11 at 14:11
  • @Schwern, I know that. But for simple cases to get some values regex can be an option. I don't want to have an argue if this is here the case or not. I don't parse html so I can't advice him in the use of those tools. But +1 for your answer. Always use the right tool! – stema Apr 13 '11 at 14:16
  • @kamaci The use of this page is quite straight forward. Enter your regex on the top and your test strings into the large text field. Matches are marcked in blue and if you move the mouse over a match it shoes you the content of the capturing groups. On the right there is a help on the different regex expressions. Be aware regexes differs from language to language, I don't know what regex engine is behind that side, I use it anyway for Perl, for the most expressions its no problem. – stema Apr 13 '11 at 14:21
  • @stema thanks for your answer and comments but can you edit your code that reads the content from a file and write the output to a file so I can see the result better? – kamaci Apr 13 '11 at 14:36
  • When I run it with online regex tester web sites it says no match found? as like http://www.regular-expressions.info/javascriptexample.html – kamaci Apr 13 '11 at 15:10
  • @kamaci For Perl there is excellent documentation on the net. Here for Perl Regex, the [perlretut](http://perldoc.perl.org/perlretut.html), [perlrequick](http://perldoc.perl.org/perlrequick.html). And here for Perl itself [Perldoc](http://perldoc.perl.org/). At the moment I am at a computer with no perl, so I am not able to provide you some code for that. Apart from this I think something should come from yourself. Especially the regular expressions are dangerous to handle if you don't understand what they are doing, and if you are not even able to fix small issues by your own. – stema Apr 13 '11 at 19:20