Extract strings using regular expressions in Delphi

Question

I have HTML where I need to collect all the content that has a particular format, e.g. get everything that is in the 00.000.000/0000-00 or XX.YYY.IIO/KKKK-LL formats.

Would use of regular expressions be the best way to accomplish this, or how else can I accomplish this?

I could do what I wanted with this: [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} — Dark Ducke, Jul 09 '15 at 23:25
That won't match `XX.YYY.IIO/KKKK-LL`. [`.{2}\..{3}\..{3}\/.{4}-.{2}`](http://regexr.com/3bc4i) ? — TLama, Jul 09 '15 at 23:36
And your use of optional for those dot, slash and hyphen separators is wrong. You would be matching also `00000000000000` with such pattern. — TLama, Jul 09 '15 at 23:52
Ugh! Yet another "how do I parse HTML/XML with regular expressions" question. Read http://stackoverflow.com/q/701166/62576 for one of many posts about why you've chosen the wrong tool for the job, and then use a DOM parser to make your life (and the lives of others who may have to maintain your code) much easier. — Ken White, Jul 10 '15 at 02:38
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 he comes — David Heffernan, Jul 10 '15 at 03:08
Hmmm... he is not really parsing the HTML itself, i.e. the tags etc. He is merely looking in a text that happens to be HTML for the pattern. But if I understand it right, the XX.YYY etc. are merely the mask he is looking for (similar to YYYY.MM.DD etc.). Then `\d{2} etc...`would probably be a solution. — Rudy Velthuis, Jul 10 '15 at 08:29
@David: a classic. And quite a lot of work to get the formatting that way. — Rudy Velthuis, Jul 10 '15 at 08:29
@Rudy, digit would not match the alpha chars from the latter example in the question. You can try it in my carrot example. — TLama, Jul 10 '15 at 08:57
As I said, ISTM the alpha chars are only the mask he/she is trying to match. IOW, the first two digits form the `XX` part, the next three digits the `YYY` part, the `II0` stands for the `II` part which consists of two digits followed by `0`, etc. — Rudy Velthuis, Jul 10 '15 at 10:01
@Rudy, if they are format strings, not value examples, then we don't know their definition, hence we could only guess here. — TLama, Jul 10 '15 at 10:35
AFAICT, he wants to look for strings in that format. So something like `\d{2}\.\d{3}\.\d{2}0/\d{4}-\d{2}`. Should not be too hard to do. How the parts of the match are used (what the digits stand for) is not the problem here. — Rudy Velthuis, Jul 10 '15 at 10:59
FWIW, the different parts (XX, YYY, etc.) could be captured separately, making the parsing a lot easier. — Rudy Velthuis, Jul 10 '15 at 11:06
@DarkDucke Could you define what you mean by XX.YYY.IIO/KKKK-LL, to avoid misconception? I thouth you want also accept letters — m.cekiera, Jul 10 '15 at 11:23
@m.cekiera, I meant any characters within this mask "nn.nnn.nnn/nnnn-nn", where nn can be alphanumeric, I've got when it's just numbers! — Dark Ducke, Jul 10 '15 at 18:57

score 3 · Accepted Answer · answered Jul 10 '15 at 20:49

If you're looking for a pattern that will match:

xx.xxx.xxx/xxxx-xx

where x is only an alphanumeric char (that is a-z, A-Z and 0-9), then you can use this pattern:

[a-zA-Z0-9]{2}\.[a-zA-Z0-9]{3}\.[a-zA-Z0-9]{3}\/[a-zA-Z0-9]{4}-[a-zA-Z0-9]{2}

You can try it in this example.

score 1 · Answer 2 · answered Jul 09 '15 at 23:54

1

Try with:

\w{2}\.\w{3}\.\w{3}\/\w{4}-\w{2}

answered Jul 09 '15 at 23:54

m.cekiera

5,365
5
21
35

Extract strings using regular expressions in Delphi

2 Answers2