3

I am trying to use this regex:

^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>

...to locate only this section:

    <ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>

...in the below document:

what is causing this text up here to be selected??

    <ProjectReference Include="..\..\Project1\Project1\Project1.csproj">
      <Project>{714c6b26-c609-40a4-80a9-421bd842562d}</Project>
      <Name>Project1</Name>
    </ProjectReference>


  <ItemGroup>
    <ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project3\Project3\Project3.csproj">
      <Project>{39860208-8146-429f-a1d1-5f8ed2fd7f5f}</Project>
      <Name>Project3</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project4\Project4.csproj">
      <Project>{58144d60-19d9-4d11-8ae6-088e03ccf874}</Project>
      <Name>Project4</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project5\Project5.csproj">
      <Project>{33baa509-ad24-4a72-a2fc-8f297e75e90d}</Project>
      <Name>Project5</Name>
    </ProjectReference>
  </ItemGroup>
  <PropertyGroup>
    <VisualStudioVersion Condition="'$(VisualStudioVersion)' == ''">10.0</VisualStudioVersion>
    <VSToolsPath Condition="'$(VSToolsPath)' == ''">$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)</VSToolsPath>
  </PropertyGroup>

In Notepad++, it appears to initially locate the match, but then it proceeds to match the entire document in a second match (so it's finding 2 matches total). I originally discovered this in my .NET app when my utility was replacing the entire contents of my project file with an empty string, effectively clearing the entire thing out.

I've spent over an hour toiling over this, so let's see if SE can figure it out.

Update: Though I've marked an answer that actually works, I ended up going with a not-so-magical approach to ensure that no rare regex quirks creep into my code later down the road as was the case recently.

^(\s+)<ProjectReference.+?({0})\.(csproj|vbproj).*\r\n.*\r\n\s+<Name>{0}</Name>\r\n\s*</ProjectReference>

...where {0} is the name of my project. While more verbose, this solution is less likely to bug out with excessive matches. I use RegexOptions.Multiline in my .NET app so that I can anchor to the beginning of a line.

oscilatingcretin
  • 10,457
  • 39
  • 119
  • 206
  • This `(.|\r\n)+`. Greedy `.` will capture everything. – Boris the Spider May 13 '16 at 17:05
  • @BoristheSpider Oops, editing mistake while composing my question. I corrected it with a `?`, but it's still doing the same thing. I copied and pasted that regex directly out of my Notepad++ find window. – oscilatingcretin May 13 '16 at 17:12
  • 1
    It seems you want to extract the portion related to `project2`. Why don't you use a xpath expression or xml parser? – Federico Piazza May 13 '16 at 17:15
  • @FedericoPiazza I guess I could. I am trying to replace a project reference with a DLL reference and regex is just the first method I could think of. I chose it because I am comfortable with regex and didn't have to learn anything new. – oscilatingcretin May 13 '16 at 17:18
  • @oscilatingcretin, ok. So, to be sure you only want the specific section you put above related to `Project2\Project2.csproj` ? – Federico Piazza May 13 '16 at 17:20
  • possible [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)? It is almost never recommended to parse XML-like documents with regex. – R Nar May 13 '16 at 17:40

2 Answers2

3

I think the best approach would be to use a xpath expression or a xml parser.

However, as you stated in your comment if you want to capture that specific portion using regex, then you can use this:

(<ProjectReference.*?Project2[\s\S]*?</ProjectReference>)

Working demo

Match information

MATCH 1
1.  [209-384]   `<ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>`

Besides regex101 also used SublimeText to show it's working, however Notepad++ has a poor regex engine and usually messes it up with tricks like [\s\S]*?:

enter image description here

On the other hand, related to your question about "why is failing", your regex is not failing but your pattern allows that greedy match (even using the lazy operator) because of your (.|\s) alternation:

^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>
                          ^--- HERE

If you check the Regex101 explanation, you can see:

2nd Capturing group (.|\s)+?
  Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
  Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
  1st Alternative: .
    . matches any character (except newline)
  2nd Alternative: \s
    \s match any white space character [\r\n\t\f ]
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • You're using the `Project2` in the `Include` element as the sentinel, while the OP is using the one in the `` element. That makes the task simpler, but can you be sure it's valid? – Alan Moore May 13 '16 at 19:25
  • @AlanMoore, good eye, didn't see that. I based on OP goal to fetch that. Let's see what OP says, maybe using Include element as the sentinel is good to go. – Federico Piazza May 13 '16 at 19:32
  • @AlanMoore Your solution works in both Notepad++ and my .NET app. I like the `[\s\S]` trick a lot. In the future, I will probably not try all this regex wizardry and just take a more literal approach which I will post at the end of my question. – oscilatingcretin May 17 '16 at 12:48
  • @oscilatingcretin glad to help. The `[\s\S]` it's a very well known trick to match everything without using the `s` flag. It is commonly used in cases like yours where the `.` doesn't match the new line but the `[\s\S]` does. – Federico Piazza May 17 '16 at 14:47
2

First, never use (.|\s) to match everything-including-newlines; it's a freeze-up waiting to happen (see this answer for more info). The search dialog in Notepad++ includes a check box for that purpose, labelled . matches newline.

Second, you should not be getting that result, no matter what. I've reproduced it in a local copy of Notepad++, and it looks like a bug. Maybe the regex is freezing, and NPP is failing to catch the error. At any rate, you should be getting only one match, and that's what happens when I select . matches newline and change your regex to this:

^\h*<ProjectReference.*?Project2</Name>.*?</ProjectReference>

However, it still matches too much, encompassing both the Project1 and Project2 elements. That's because non-greedy quantifiers only affect where matching ends, not where it begins. You need to use something more specific to make sure the match doesn't extend beyond the element where it started. I think this is the surest way to do that:

^\h*<ProjectReference(?:(?!</?ProjectReference).)*Project2</Name>.*?</ProjectReference>

The idea is that the dot is allowed any match character (including newlines), unless it's the first character of the sequence <ProjectReference or </ProjectReference. So, once it starts matching the opening <ProjectReference> tag, it can match anything except another such tag (opening or closing), until it finds the sentinel string (Project2).

UPDATE: This is definitely a bug in Notepad++. I've done some more testing myself, and found other reports to confirm it (here and here). Those guys get pretty creative in their attempts to trigger the bug, but it boils down to this: if the regex takes too long to match, NPP incorrectly selects everything.

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I tried your second regex in Notepad++ and it works, but I have to have `. matches newline` enabled. This issue was originally discovered in my .NET app, so I need a solution that works there. The .NET native regex options only support `RegexOptions.Multiline` which isn't the same as Notepad++'s option. I upvoted your answer, though. I've found a solution that takes a more literal approach rather than trying to do all this regex sorcery to match magical patterns. I will post it shortly – oscilatingcretin May 17 '16 at 12:29
  • Sorry, I thought Notepad++ was your target flavor. In .NET, you have to use `Multiline` mode to make `^` match at the beginning of a line (NPP is *always* in multiline mode), and `Singleline` to make `.` match newlines. Also, `\h` (horizontal whitespace) is not supported in .NET, so either use `[ \t*]`, or go back to using `\s*`. Or drop it altogether; unless you're tying to normalize leading whitespace, that part isn't necessary. – Alan Moore May 17 '16 at 18:20
  • The problem with SingleLine mode in .NET is that, according to my tests, it literally treats the entire sting as a single-line string, so you can't use `^` to anchor to lines in the middle of the string. – oscilatingcretin May 18 '16 at 10:55
  • I think you must be misinterpreting your test results. Singleline only affects the dot, and Multiline only affects the anchors (`^` and `$`). There's no overlap between the two, and (despite what their names imply) they are not mutually exclusive. – Alan Moore May 18 '16 at 17:14