2

There are several references (and examples) in the Internet on the use of the Unicode combining mark \u0300 involving the a grave, à character that specify the \u0061 \u0300 Unicode code-point pair. Why does this not work with me? What am I missing?

\u00e0\u00e0(\u0061\u0300)?à
matches ààà

\u00e0\u00e0\u0061\u0300
does not match ààà
Allan
  • 12,117
  • 3
  • 27
  • 51
TonyR
  • 128
  • 1
  • 6
  • Have a look at this link: https://books.google.co.jp/books?id=6k7IfACN_P8C&pg=PA59&lpg=PA59&dq=%5Cu0061%5Cu0300&source=bl&ots=CqFhcOvp0K&sig=hKd1O51YH-U-KD9nhF39juQnNGc&hl=fr&sa=X&ved=0ahUKEwibtNXZ88zZAhWRwYMKHUBdAWgQ6AEIKzAA#v=onepage&q=%5Cu0061%5Cu0300&f=false (book: **Regular Expressions Cookbook** from O'Reilly) and read page 58, 59 for very good explanations about it. – Allan Mar 02 '18 at 05:46
  • How do you test the string and regex? Please post the code. – Wiktor Stribiżew Mar 02 '18 at 08:12
  • There are two ways to do this: http://unicode.org/reports/tr15/ – Hans Passant Mar 02 '18 at 08:35
  • 1
    https://stackoverflow.com/questions/16467479/normalizing-unicode – Hans Passant Mar 02 '18 at 08:37
  • Allan, Hans: I read the Cookbook and, for test purposes, duplicated the code - which, for me, does not function as explained. The metodology is clear. – TonyR Mar 02 '18 at 18:42
  • Wiktor: REGEX STORM.NET Pattern: \u00e0\u00e0(\u0061\u0300)?à Input: ààà (3 matches not 4 as expected!) – TonyR Mar 02 '18 at 18:47
  • Hans: The problem with normalising Unicode is that I cannot call functions in my "native" regex environment. The two authoritive regex books "Mastering Regular Expressions" and Regular Expression Cookbook" imply that the use of "combining marks" is possible, and even provide code snippets. – TonyR Mar 04 '18 at 08:24
  • This problem is caused by the character "displayed" as à having three possible codings in a file: ASCII \xe0, \u00e0 and \u0061\u0300. A regex with \xe0 or (\u00e0) and à matches only the first two codings. A regex to process strings that may contain "combining marks" must match with \u00e0|\u0061\u0300, etc. – TonyR Mar 06 '18 at 05:19

0 Answers0