0

how we can capture this optional group? (I mean consuming multiple lines) enter image description here

green group->optional group

red line->new segment(same patterns repeat)

my pattern:

(\t{2}<idx:entry name="dic">\r\n)(\t{4}<idx:orth>)(.+\r\n)(\t{4}<idx:infl>[^</idx:infl>]+)?

enter image description here

any idea how to capture this optional group which doesn't have a fixed length?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
wiki
  • 1,877
  • 2
  • 31
  • 47

2 Answers2

1

Try this:

\s*<idx:entry name="dic">\s*<idx:orth>[^<]*\s*(<idx:infl>\s*.*\s*</idx:infl>)

Whitespace between tags is ignored in XML so you shouldn't have to specify exact number of tabs and linebreaks in your regex. Just use \s to signify whitespace (this includes spaces, tabs and line breaks).

Everything in between the parantheses () is captured and you can access this group using \1 or $1 depending on your regex engine.

However, when parsing XML it's generally a better idea to use a proper DOM parser like XPath.

Felix Glas
  • 15,065
  • 7
  • 53
  • 82
  • +1 on the DOM parser suggestion, you're heading for a ton a headache using regex for this. Obligatory zalgo answer link: http://stackoverflow.com/a/1732454/8127 – Sundar R Jul 21 '13 at 19:16
  • I'm just inspecting the power of regex; this is just a case of study; you're pattern not working; my question is: how to consume multiline? – wiki Jul 21 '13 at 20:39
  • anyway I found the answer; [\s\S]* – wiki Jul 21 '13 at 20:58
  • @wiki What do you mean by "consume multiline"? Regexes are only used to match certain parts of text. The programmer then have to decide what to do with the matching text like replacing it or checking that it exists etc. – Felix Glas Jul 21 '13 at 21:02
0

I found this helpful for consuming multilines:

[\s\S]*</idx:infl>
wiki
  • 1,877
  • 2
  • 31
  • 47
  • Ok, I thought you wanted to capture the "optional group" you have marked with green in your picture. This will simply match everything from start to ``. – Felix Glas Jul 21 '13 at 21:18