0

I noticed that when I run a RegEx expression, it has inconsistent index positions of results when run in another machine. Am I missing something here...

Give this sql statement:

var sqlStatementRaw = $@"
--GO inline comment
CREATE PROC table1
AS
    SELECT 1;
GO

/*
GO in inline comment block
*/

CREATE PROC table2
AS
    SELECT 1;
GO

/* multiline comment block
GO
*/

CREATE PROC table3
AS
    SELECT 1;
";

And this pieace of code:

var resultList = new List<CommentAnalyzerResult>();

//all comments
//https://stackoverflow.com/questions/7690380/regular-expression-to-match-all-comments-in-a-t-sql-script/33947706#33947706
var regex = new Regex(@"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline | RegexOptions.CultureInvariant);

var match = regex.Match(sqlStatementRaw);
while (match.Success)
{
    var commentBlock = new CommentAnalyzerResult { Text = match.Value, Start = match.Index, End = match.Index + match.Length };
    resultList.Add(commentBlock);

    match = match.NextMatch();
}

Machine 1 results:

## results from my win10/x64 machine
Environment.OSVersion.Platform: Win32NT, Microsoft Windows NT 10.0.18362.0
Environment.NewLine.byteCount: 2

sqlStatementRaw.byteCount: 251

commentText.byteCount: 21
startPosition: 2, stopPosition: 23

commentText.byteCount: 34
startPosition: 75, stopPosition: 109

commentText.byteCount: 34
startPosition: 165, stopPosition: 199

Machine 2 results:

## results in AppVeyor Agents - Previous Visual Studio 2019 Image
Environment.OSVersion.Platform: Win32NT, Microsoft Windows NT 10.0.17763.0
Environment.NewLine.byteCount: 2

sqlStatementRaw.byteCount: 228

commentText.byteCount: 20
startPosition: 1, stopPosition: 21

commentText.byteCount: 32
startPosition: 68, stopPosition: 100

commentText.byteCount: 32
startPosition: 149, stopPosition: 181


EDIT:
- One machine is my Windows 10 dev machine and other is AppVeyor Build Image with VS2019
- Using .NET Core 3.0, both machines are 64-bit. The tests are run via MsTest drivers.

EDIT:
- Added more debug info: OS info, byte count

rdagumampan
  • 459
  • 4
  • 16
  • @Barns No. One is my Windows 10 dev machine and other is AppVeyor Build Image with VS2019. – rdagumampan May 16 '20 at 18:47
  • 1
    .NET Core 3.0, both machines are 64-bit. The tests are run via MsTest drivers. It was a unit tests so the sqlStatementRaw is living within the unit test code.Maybe this would help https://github.com/rdagumampan/yuniql/blob/39508171748ce698cfd6ee4abd51ed406200a116/yuniql-tests/platform-tests/Platforms/SqlServer/SqlServerTestDataService.cs#L134 – rdagumampan May 16 '20 at 18:52
  • 1
    Here is my guess--one of the machines my be interpreting the line breaks differently. Remember `\n is used for Unix systems (including Linux, and OSX). \r\n is mainly used on Windows. \r is used on really old Macs. – Barns May 16 '20 at 19:01
  • The length of second match should be 34 not 32 unless crlf are converted to just lf. The start should be zero based string. Since 1 would indicate just lf, 68 is correct if crlf. Does not jive. Don't save the end position, just the start and length. But, its all about crlf translation apparently, but conflicts as seen. The length must be discerned as it may be in error as welll –  May 16 '20 at 19:13
  • Must say the logic of comments here are a confusion since the C++ model is `/\*.*?\*/` –  May 16 '20 at 19:20
  • Its likely to be Winx64 image as it has visual studio 2019 installed and sql server service. – rdagumampan May 16 '20 at 19:40
  • I have now added debug info on OS info and bytecount for both NewLine and sqlStatementRaw. I think we can deduct now the Linux hypothesis. – rdagumampan May 17 '20 at 06:41
  • From the tone of the title is sounds like you suspect C# regex 'index' member. Been around for a long time to be a bug. Could just be a string length bug. CRLF translation to/from is done by code apps when they call underlying OS open modes. It's selectable, but as I stated previously, in your original output there is an inconsistent result in the first group that should not be like that. But you didn't seem to care about that. –  May 17 '20 at 19:58
  • I don't suspect this to be a bug in RegEx as I have strong feeling this is machine related issue I just couldn't find possible explanation as both runs on Windows Machine, same Newline length, exactly the same string input... as you guys suggested, its certainly not the same as the byte count differs between machines. UTF8.GetBytes() of the raw string input returns different bytecount so its natural that the expression yields different positions also... – rdagumampan May 17 '20 at 20:34
  • I still couldn't explain why the position starts at 1 and not 0 though. – rdagumampan May 17 '20 at 20:35

2 Answers2

1

I suspect the string literal with hard line-breaks is being interpreted differently in the two environments. Essentially adding an index position to the result for each line-break.

\n used for Unix systems (including Linux)
\r\n on Windows
\r is used on some Macs

You might be able to check this by performing this line of code on each machine and comparing the resulting values:

var byteCount = System.Text.Encoding.UTF8.GetByteCount(System.Environment.NewLine);


My results:

Mac        ==> 1
Windows 10 ==> 2

EDIT
Or maybe substitute "System.Environment.NewLine" in the above code and use your "sqlStatementRaw" and compare length.

Barns
  • 4,850
  • 3
  • 17
  • 31
  • I have strong feelingits really about the NewLine. Maybe ill try to prepare the test sql statement with Environment.NewLine also. – rdagumampan May 16 '20 at 19:43
  • Preparing a statement using `Environment.NewLine` will not help. You need to analyze how each machine interprets the `NewLine` which is why you need to run `var byteCount = System.Text.Encoding.UTF8.GetByteCount(System.Environment.NewLine);` OR `var byteCount = System.Text.Encoding.UTF8.GetByteCount(sqlStatementRaw);` and compare lengths. – Barns May 16 '20 at 19:49
  • I have now added debug info on OS info and bytecount for both NewLine and sqlStatementRaw. I think we can deduct now the Linux hypothesis. – rdagumampan May 17 '20 at 06:40
  • I did not marked this as an answer but it certainly led me through to get he answer. Thanks mate and all. – rdagumampan May 17 '20 at 20:38
0

Lost couple of hair on this but finally made it work.

"Git actually replaces CR LF with LF on commit, but tests do not know about it and use wrong, modified versions of files when running on AppVeyor." https://github.com/MikhailTymchukDX

I verified this again by showing here the SqlServerTestDataService.cs that produces the the raw sql statement. One the local version via git clone and another one I downloaded directly from github raw file.

From git clone

enter image description here

File directly downloaded from github

enter image description here

So I had to modify my AppVeyor setup and all tests passed.

enter image description here

Reference: https://help.appveyor.com/discussions/problems/3954-test-failure-cause-by-different-newlines https://github.com/appveyor/ci/issues/642 SourceTree App says uncommitted changes even for newly-cloned repository - what could be wrong?

rdagumampan
  • 459
  • 4
  • 16