.net regex - strings that don't contain full stop on last list item

Question

I'm trying to use .net regex for identifying strings in XML data that don't contain a full stop before the last tag. I have not much experience with regex. I'm not sure what I need to change & why to get the result I'm looking for.

There are line breaks and carriage returns at end of each line in the data.

A schema is used for the XML.

Example of good XML Data:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc.</item>
</randlist>

Example of bad XML Data - regexp should give matches - no full stop preceding last </item>:

<randlist prefix="unorder">
    <item>abc</item>
    <item>abc</item>
    <item>abc</item>
</randlist>

Reg exp pattern I tried that didn't work in the bad XML data (not tested on good XML data):

^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$

Results using http://regexstorm.net/tester:

0 matches

Results using https://regex101.com/:

0 matches

This question is different to the following imo, due to full stop and start of string criteria:

Regex for string not ending with given suffix

Explanation from 3:

/
^<randlist \w*=[\S\s]*\.*[^.]<\/item>[\n]*<\/randlist>$
/
gm
^ asserts position at start of a line
<randlist  matches the characters <randlist  literally (case sensitive)
\w* matches any word character (equal to [a-zA-Z0-9_])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
= matches the character = literally (case sensitive)
Match a single character present in the list below [\S\s]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\.* matches the character . literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^.]
. matches the character . literally (case sensitive)
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
item> matches the characters item> literally (case sensitive)
Match a single character present in the list below [\n]*
< matches the character < literally (case sensitive)
\/ matches the character / literally (case sensitive)
randlist> matches the characters randlist> literally (case sensitive)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

All you need is adding a star (*) to your regex: ^[\n]*<\/randlist>$ — Poul Bak, Jan 21 '20 at 19:31
This is a step in the right direction. I have posted an updated question on https://stackoverflow.com/questions/59858437/net-regex-strings-that-dont-contain-full-stop-preceding-last-item-atte — unseen_rider, Jan 22 '20 at 11:20

Zaelin Goodman · Answer 1 · 2020-01-22T16:51:38.023

@Silvanas is absolutely correct. You should not use Regex for this problem, you should use some form of XML parser to read the data and find the lines with .. However, if for some horrible reason you MUST use Regex, and If your data is structured exactly like your example, then the Regex solution would be the following:

^\s+<item>[^<]*?(?<=\.)<\/item>$

If there ARE any matches with that regex, your xml is malformed. But again, this regex fails if the whitespace isn't correct, if there's anything else on the line, if the tags arent <item>..</item>, and so on and so on. Again, you would be far, far better off not using Regex for this problem unless you can absolutely guarantee that everything but the . is going to be well-formed XML

EDIT: If the opening and closing tag are on the same line, but it isn't necessarily titled 'item', and may have attributes, go ahead and try the following:

^\s+<([^<>\s]+)[^<>]*>[^<>]*?(?<=\.)<\/\1>$

Breakdown:
^           anchor to beginning of line
\s+         skip over any whitespace
<           found what looks like an opening tag
([^[]\s]+)  match the first word found after the "<", store in capture group 1
[^<>]*>     match whatever remain until the closing ">"
[^<>]*?     match all of the contents up until the next "<"
(?<=\.)     ensure the last character was a "."
<\/\1>      match a closing tag where the text after the / is the same as the first word of the opening tag (stored in capture group 1)
$           anchor to end of line

Make sure you have the MultiLine regex option set, otherwise ^ and $ will match the beginning/end of the entire string. As with before, any matches with this regex mean the XML is poorly formed on that line.

The xml is well formed, it has to be that is the whole purpose of using a schema to control the structure and content — unseen_rider, Jan 22 '20 at 10:30

.net regex - strings that don't contain full stop on last list item

1 Answers1

Linked