Regular expression occasionally gives wrong answer

Question

I am trying to match chess notation. I have a C# regular expression like this:

"(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?";

[I would not mind a C# YACC lexer for Short Algebraic Notation (SAN), but I am using regex for now:]

<move> ::= <move number><move descriptor>
<move number> ::= <digit>[<digit>...]{'.' | '...'}

<move descriptor> ::= <from square><to square>[<promoted to>]
<square>        ::= <file letter><rank number>
<file letter>   ::= 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'
<rank number>   ::= '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'
<promoted to>   ::= 'q'|'r'|'b'|'n'

<Piece symbol> ::=  'P' | 'N' | 'B' | 'R' | 'Q' | 'K'

<SAN move descriptor piece moves>   ::= <Piece symbol>[<from file>|<from rank>|<from 
square>]['x']<to square>
<SAN move descriptor pawn captures> ::= <from file>[<from rank>] 'x' <to square>[<promoted to>]
<SAN move descriptor pawn push>     ::= <to square>[<promoted to>]

Sometimes the above regex matches too much, for example these first few moves are matched this way (I remove move numbers before matching):

1e4d5 
2exd5Nf6 
3d4Qxd5
4c4Qd6
5Nf3c5
6Be3cxd4
7Nxd4a6
8Be2e5
9Nf3Nc6
10O-OQxd1
11Rxd1Be7
12Nc3Be6
13Nd5Bd8
14Nb6Rb8
15Ng5Bf5
16Bf3e4
17Be2h6
18Nh3Bxh3
19gxh3O-O
20Nd7Nxd7
21Rxd7Bf6
22Bf4Rfd8
23Rad1Be5
24Bxe5Rxd7
25Rxd7Nxe5
26Re7Nf3+
27Kg2f5
28Bxf3exf3+
29Kxf3Rc8
30b3b5
31cxb5axb5
32Rb7Ra8
33Rxb5Rxa2
34Rxf5Rb2
35Rb5Kf7
36Rb7+Kf6
37h4g5
38h5Ke5
39Rb6Kd5
40Rxh6Rxb3+
41Kg4Rb2

Matches to:

    e4d5
    exd5
    Nf6
    d4
    Qxd5
    c4
    Qd6
    Nf3c5

Result is supposed to be (the code adds a period after the move # but not necessary):

1e4 d5 
2exd5 Nf6 
3d4 Qxd5
4c4 Qd6
5Nf3 c5
6Be3 cxd4
7Nxd4 a6
8Be2 e5
9Nf3 Nc6
10O-O Qxd1
11Rxd1 Be7
12Nc3 Be6
13Nd5 Bd8
14Nb6 Rb8
15Ng5 Bf5
16Bf3 e4
17Be2 h6
18Nh3 Bxh3
19gxh3 O-O
20Nd7 Nxd7
21Rxd7 Bf6
22Bf4 Rfd8
23Rad1 Be5
24Bxe5 Rxd7
25Rxd7 Nxe5
26Re7 Nf3+
27Kg2 f5
28Bxf3 exf3+
29Kxf3 Rc8
30b3 b5
31cxb5 axb5
32Rb7 Ra8
33Rxb5 Rxa2
34Rxf5 Rb2
35Rb5 Kf7
36Rb7+ Kf6
37h4 g5
38h5 Ke5
39Rb6 Kd5
40Rxh6 Rxb3+
41Kg4 Rb2

Notice that the first and fifth moves are wrong, since it matches both white's and black's move.

What is the modification to my regex to get it to work so that it always just matches one sides move?

Here is the code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

namespace ChessPGNParserConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            string regexStr = @"(?:[PNBRQK]|[a-h][1-8]?x)?[a-h][1-8]|(O(-?O){1,2})[\+#]?(\s*[\!\?]+)?";
            //string regexStr = @"(?:[PNBRQK]|[a-h][1-8]?x)?[a-h][1-8](?:\=[PNBRQK])?(O(-?O){1,2})[\+#]?(\s*[\!\?]+)?";

            //string regexStr =   @"(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?";
            //string regexStr =   @"(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?";
            //string regexStr = @"(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8])";

            string startsDigitRegexStr = @"^\d*";

            Regex regexpr = new Regex(regexStr);
            Regex regexprDigit = new Regex(startsDigitRegexStr);

            // Read the file and display it line by line.
            System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Users\idf\Documents\My Chess Database\chessgame.txt");

            string replacement = "";

            int moveNumber = 1;

            string line;
            while (null != (line = file.ReadLine()))
            {
                MatchCollection mcDigit = regexprDigit.Matches(line);
                foreach (Match m in mcDigit)
                {
                    line = regexprDigit.Replace(line, replacement);

                    //Console.WriteLine(m);
                }

                //Console.WriteLine(line);

                MatchCollection mc = regexpr.Matches(line);

                int twoMoves = 0;
                Console.Write(moveNumber.ToString() + ". ");

                foreach (Match m in mc)
                {
                    Console.Write(m + " ");

                    if(1 == twoMoves++)
                        Console.WriteLine();
                }

                moveNumber++;
            }

            Console.ReadLine();
        }
    }
}

What is the result supposed to be? Not all of us read chess moves every day. — Kendall Frey, Aug 08 '14 at 19:07

welegan · Answer 1 · 2014-08-08T18:55:34.563

0

Untested, but try this one:

(?:[PNBRQK]|[a-h][1-8]?x)?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?

Reasoning:

[PNBRQK]|[a-h][1-8]?x

Match [PNBRQK]x or [a-h]x or [a-h][1-8]x at most once, but require the x at the end of the prefix. This means the first half of e4d5 (e4) will not match the [a-h][1-8] portion of this group because we are requiring the ending x character.

[a-h][1-8]

Match the contents of the actual move without the prefix. This will match e4 and only e4, leaving d5 from the first line to be interpreted as the next move.

Your original regex makes too liberal a use of the ? operator before the x character, so the [a-h]?[1-8]?x? matched e4 then the [a-h][1-8] matched d5, resulting in the incorrect output of the first turn. Lastly, I have no idea if the stuff after [a-h][1-8] is necessary, but I left it in there since I don't know the details of chess notation.

edited Aug 08 '14 at 18:55

answered Aug 08 '14 at 18:32

welegan

3,013
3
15
20

I can't compile string regexStr = @"(?:(?:[PNBRQK]|[a-h][1-8]?)x)?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?"; Regex regexpr = new Regex(regexStr); The exception is "too many )" which is strange because they match. – Ivan Aug 08 '14 at 18:44
I removed the inner parentheses, upon second reading, they may not be necessary. – welegan Aug 08 '14 at 18:57
I have included an entire game. All proposed solutions solve one problem, but introduce others. – Ivan Aug 08 '14 at 22:36

score 0 · Answer 2 · answered Aug 08 '14 at 18:33

0

For any practical use, you will need long notation (e.g. to resolve ambiguity between Nb1d2 and Nf3d2).

But why not reuse:

https://chessprogramming.wikispaces.com/Algebraic+Chess+Notation

answered Aug 08 '14 at 18:33

Pieter21

1,765
1
10
22

score 0 · Answer 3 · edited May 23 '17 at 10:33

0

Please for our understanding (and yours in the future, if you need to read or document it in the near future), break down your expression in smaller pieces. You can have the regex builder do the smart scan table optimization.

You could also build up a few unit tests to test your regex scanner.

You can 'OR' the results later without having to worry about duplicates too much.

Like in the following link: Combine Regexp

edited May 23 '17 at 10:33

Community

1
1

answered Aug 08 '14 at 21:53

Pieter21

1,765
1
10
22

When I read to complete match, I think you should use separate scanners anyway. You'd have some extra validation points easily to check if pawn move cxe6 (which would be valid in your regexp, but invalid in practice.) Also ambiguity resolving for Rooks Rfd8 and kNight (Nbd2) can become difficult otherwise. Also, from the initial regexp, you could try to also match both moves and the end of line character ($), so the lexer knows it should find two 'https://chessprogramming.wikispaces.com/Ply' on the same move. – Pieter21 Aug 09 '14 at 07:36

Regular expression occasionally gives wrong answer

3 Answers3