0

I'm working with regular expression, C# flavor on large XML file. I've noticed through more than one case that some patterns that supposed to run on a kind of "large size strings" are not matching at all. For example, the following string:

<p>
<?tex xxxxxx ?>
</p>
  <table-wrap position="float">
 <table>
 <tbody>
 <tr>
<td colspan="2">
<hr/>
</td>
</tr>
<tr>
<td>
<nlm.tabular>Patient</nlm.tabular>
</td>
<td>
<nlm.tabular>Patient Waiting Time</nlm.tabular>
</td>
</tr>
<tr><td
<nlm.tabular>1st patient in block <italic>B</italic>
<subscript>1</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular>0</nlm.tabular>
</td>
</tr><tr>
<td>
<nlm.tabular>2nd patient in block <italic>B</italic><subscript>1</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn></mml:mrow>
</mml:msubsup>
</mml:math></nlm.tabular>
</td>
</tr>
<tr>
<td>
<nlm.tabular>3rd patient in block <italic>B</italic>
<subscript>1</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn></mml:mrow></mml:msubsup></mml:math>
</nlm.tabular>
<?tex llllll?>
</td></tr>
<tr>
<td>
<nlm.tabular/>
</td>
<td>
<nlm.tabular/>
</td>
</tr>
<tr>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:mo>⋮</mml:mo>
</mml:math>
</nlm.tabular>
</td>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:mo>⋮</mml:mo></mml:math>
</nlm.tabular>
<?tex cccccccc?>
</td></tr>
<tr>
<td>
<nlm.tabular>
<italic>n</italic>
<subscript>1</subscript>th patient in block <italic>B</italic>
<subscript>1</subscript></nlm.tabular>
</td><td>
<nlm.tabular>
<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow><mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow></mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
</mml:msubsup><mml:mo>+</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</nlm.tabular>
</td>
</tr>
<tr>
<td colspan="2">
<hr/>
</td>
</tr>
<tr>
<td>
<nlm.tabular>vvvvvvvvvvvvvv</nlm.tabular>
</td>
<td>
<nlm.tabular>2222222222</nlm.tabular>
</td>
</tr>
<tr>
<td>
<nlm.tabular>Patient</nlm.tabular>
</td>
<td>
<nlm.tabular>Patient Waiting Time</nlm.tabular>
</td>
</tr>
<tr>
<td>
<nlm.tabular>1st patient in block <italic>B</italic>
<subscript>2</subscript>
</nlm.tabular>
</td>
<td>
<nlm.paragraph>0&lt;?tex type="longcontinued-tabular" cols="xxx" width="yyy"?&gt;</nlm.paragraph>
</td>
</tr>
<tr>
<td>
<nlm.tabular>2nd patient in block <italic>B</italic>
<subscript>2</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</nlm.tabular></td>
</tr>
<tr>
<td>
<nlm.tabular>3rd patient in block<italic> B</italic>
<subscript>b</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow><mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn></mml:mrow><mml:mrow>
<mml:mi>b</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</nlm.tabular>
</td>
</tr>
<tr>
<td rowspan="2">
<nlm.tabular>
<mml:math display="block">
<mml:mo>⋮</mml:mo>
</mml:math>
</nlm.tabular>
<nlm.tabular>
<italic>n</italic>
<subscript>b</subscript>th patient in block<italic> B</italic>
<subscript>b</subscript>
</nlm.tabular>
</td>
<td>
<nlm.tabular>
<mml:math display="block"><mml:mo>⋮</mml:mo>
</mml:math>
</nlm.tabular>
</td>
</tr>
<tr>
<td colspan="0">
<nlm.tabular/>
</td>
<td>
<nlm.tabular>
<mml:math display="block">
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn fontstyle="italic">2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:mo>…</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn fontstyle="italic">1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</nlm.tabular>
</td>
</tr>
<tr>
<td colspan="2"><hr/>
</td>
</tr>
</tbody>
</table>
</table-wrap>

I have tried to match it with :

MatchCollection mc = Regex.Matches(input , @"\<p\>\<\?tex\s*.*?\?\>\s*\<\/p\>\s*\<table\-wrap\s*.*?\>.*?\<\/table\-wrap\>);

Simply this doesn't give any match at all. The same happens with strings of such a size.

So where the issue would be?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    The
    cannot hold. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
    – Mark Broadhurst Sep 19 '12 at 08:58
  • 1
    Thats a long regex. I would cut it down to something smaller for instance see if just "td" matches correctly. In situations like this it is usually a bug in your regex. And using regex to match HTML is usually a bad idea – andy boot Sep 19 '12 at 08:59
  • Saint Gerbil , sorry i can't see your point – Sarah Mohammad Sep 19 '12 at 09:14
  • Andy boot , i'd rather do this too , but the business requires to use it the way i mentioned above – Sarah Mohammad Sep 19 '12 at 09:15
  • 1
    @user1682418: Franky, "the business" is stupid and encourages brittle code. Take the initiative and use an XML parser. Doing this with regex is plain daft. – spender Sep 19 '12 at 09:22
  • 1
    I'd suggest reading up on Chomsky Hierarchy and then read your question again. (hint regex deals with type 3 and xml is type 2). http://en.wikipedia.org/wiki/Chomsky_hierarchy – Mark Broadhurst Sep 19 '12 at 09:24
  • Saint Gebril , i thought that .Net frameowrk engine for regex implements the Non-deterministic type – Sarah Mohammad Sep 19 '12 at 11:08

1 Answers1

1

use this regex <p>\s*<\?tex\s*.*?\?\>\s*\<\/p\>\s*\<table\-wrap\s*.*?\>.*?\<\/table\-wrap\> you lost spaces between tags

burning_LEGION
  • 13,246
  • 8
  • 40
  • 52