2

I have HTML where I need to collect all the content that has a particular format, e.g. get everything that is in the 00.000.000/0000-00 or XX.YYY.IIO/KKKK-LL formats.

Would use of regular expressions be the best way to accomplish this, or how else can I accomplish this?

Nathan Tuggy
  • 2,237
  • 27
  • 30
  • 38
Dark Ducke
  • 145
  • 1
  • 3
  • 12
  • I could do what I wanted with this: [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} – Dark Ducke Jul 09 '15 at 23:25
  • That won't match `XX.YYY.IIO/KKKK-LL`. [`.{2}\..{3}\..{3}\/.{4}-.{2}`](http://regexr.com/3bc4i) ? – TLama Jul 09 '15 at 23:36
  • And your use of optional for those dot, slash and hyphen separators is wrong. You would be matching also `00000000000000` with such pattern. – TLama Jul 09 '15 at 23:52
  • 1
    Ugh! Yet another "how do I parse HTML/XML with regular expressions" question. Read http://stackoverflow.com/q/701166/62576 for one of many posts about why you've chosen the wrong tool for the job, and then use a DOM parser to make your life (and the lives of others who may have to maintain your code) much easier. – Ken White Jul 10 '15 at 02:38
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 he comes – David Heffernan Jul 10 '15 at 03:08
  • Hmmm... he is not really parsing the HTML itself, i.e. the tags etc. He is merely looking in a text that happens to be HTML for the pattern. But if I understand it right, the XX.YYY etc. are merely the mask he is looking for (similar to YYYY.MM.DD etc.). Then `\d{2} etc...`would probably be a solution. – Rudy Velthuis Jul 10 '15 at 08:29
  • @David: a classic. And quite a lot of work to get the formatting that way. – Rudy Velthuis Jul 10 '15 at 08:29
  • @Rudy, digit would not match the alpha chars from the latter example in the question. You can try it in my carrot example. – TLama Jul 10 '15 at 08:57
  • As I said, ISTM the alpha chars are only the mask he/she is trying to match. IOW, the first two digits form the `XX` part, the next three digits the `YYY` part, the `II0` stands for the `II` part which consists of two digits followed by `0`, etc. – Rudy Velthuis Jul 10 '15 at 10:01
  • @Rudy, if they are format strings, not value examples, then we don't know their definition, hence we could only guess here. – TLama Jul 10 '15 at 10:35
  • AFAICT, he wants to look for strings in that format. So something like `\d{2}\.\d{3}\.\d{2}0/\d{4}-\d{2}`. Should not be too hard to do. How the parts of the match are used (what the digits stand for) is not the problem here. – Rudy Velthuis Jul 10 '15 at 10:59
  • FWIW, the different parts (XX, YYY, etc.) could be captured separately, making the parsing a lot easier. – Rudy Velthuis Jul 10 '15 at 11:06
  • @DarkDucke Could you define what you mean by XX.YYY.IIO/KKKK-LL, to avoid misconception? I thouth you want also accept letters – m.cekiera Jul 10 '15 at 11:23
  • @m.cekiera, I meant any characters within this mask "nn.nnn.nnn/nnnn-nn", where nn can be alphanumeric, I've got when it's just numbers! – Dark Ducke Jul 10 '15 at 18:57

2 Answers2

3

If you're looking for a pattern that will match:

xx.xxx.xxx/xxxx-xx

where x is only an alphanumeric char (that is a-z, A-Z and 0-9), then you can use this pattern:

[a-zA-Z0-9]{2}\.[a-zA-Z0-9]{3}\.[a-zA-Z0-9]{3}\/[a-zA-Z0-9]{4}-[a-zA-Z0-9]{2}

You can try it in this example.

TLama
  • 75,147
  • 17
  • 214
  • 392
1

Try with:

\w{2}\.\w{3}\.\w{3}\/\w{4}-\w{2}
m.cekiera
  • 5,365
  • 5
  • 21
  • 35