1

Im working on an application in portuguese so my regular expression must contain the following characters (áàâãéèêíïóôõöúçñ) so far i have /(\b[a-z])/g but it treats those special characters as the start of a new word. Im doing this in Javascript

Example:

Input: rua são luiz
Current output: Rua SãO Luiz
Desired Output: Rua São Luiz

t.niese
  • 39,256
  • 9
  • 74
  • 101
nicolasLima
  • 65
  • 1
  • 5
  • You probably need to specify what language you are using, since there are differences in all Regular Expression engines – Codebling Apr 15 '20 at 18:24
  • 1
    Thanks for the heads up! this is my first post here and i didnt know regex differed from language to language – nicolasLima Apr 15 '20 at 18:34
  • Does this answer your question? [Concrete Javascript Regex for Accented Characters (Diacritics)](https://stackoverflow.com/questions/20690499/concrete-javascript-regex-for-accented-characters-diacritics) – Hamms Apr 15 '20 at 18:44
  • You can use the regular expression `(?<![:alpha:])([:lower:])` to save the first character of each word that is a Unicode lower-case letter to capture group 1 and then use a backreference `\1` to convert those letters to upper-case. [Demo](https://regex101.com/r/vn9Evu/2/). `(?<![:alpha:])` is a *negative lookbehind*, meaning that that the character matching `[:lower:]` cannot be preceded by a (Unicode) letter. – Cary Swoveland Apr 15 '20 at 19:26

1 Answers1

2

The problem with that regular expression is the \b. It is defined with a reference to word-characters (\w) which do not include accented characters. So the ã in your example is a word boundary. You can read more about character classes and assertions here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

May I suggest another approach without regular expressions?

'rua são luiz'
    .split(' ')
    .map(word => word.charAt(0).toUpperCase() + word.slice(1))
    .join(' ')

This is just my personal opinion: Regular expressions can quickly become very hard to understand and debug. Sometimes a little more verbose code might be the more maintainable solution.

Manuel
  • 36
  • 3
  • "XXX can quickly become very hard to understand and debug." is often heard from someone new to *any* computer language. While it is true that there are many situations where a complex regular expression could be used but other approaches are preferable, in this case a relatively simple regular expression can be employed. It is not helpful to guess about the difficulty of approaches you are not familiar with. – Cary Swoveland Apr 15 '20 at 19:31
  • @CarySwoveland I think it's unfair to suggest that Manuel is new to or unfamiliar with regular expressions. One can say, objectively, that complex regular expressions are harder to understand and harder to debug than code. There are a few reasons: 1. debugging: no tools to "debug" regular expressions (while there are tools which can help by highlighting matching sections, there is no way to "step through" a regular expression engine the way one does with code). Regular expression engines are stateful but we have no window into that state. 2. No way to comment parts of regular expressions. – Codebling Apr 16 '20 at 01:13
  • It's like looking at minified code. Can you understand it if you try? Yes. But it is never going to be as readable as the full source. – Codebling Apr 16 '20 at 01:16
  • @Codebling, Manual did not say that a regular expression should not be used here because it would have to be unreasonably complex (which would have been incorrect in view of the relatively simple regex I gave in a comment above), but suggested that regular expressions should be avoided generally because they can become complex. I cannot accept that, considering their utility. Note that comments can be included in regular expressions by defining them in [free-spacing mode](https://www.regular-expressions.info/freespacing.html). – Cary Swoveland Apr 16 '20 at 01:42
  • @CarySwoveland The regular expression you gave does not work in Javascript. Even the Demo you linked yields a pattern error. If you know how to solve this with a simple regular expression in Javascript (and not POSIX or whatever that is), please tell us. I am always interested to learn more. – Manuel Apr 16 '20 at 05:56
  • Manual I had selected the "EcmaScript (Javascript)" engine at the link I gave in my comment on the question (click on the three horizontal bars to the left of "regular expressions 101"). I don't see a pattern error. The regex matches the first letter of each word in `"rua oãs luiz"` and saves it to capture group 1. There appears to be a problem, however. If I change the second word to `"ãos"` the `"o"` is matched, not the `"ã"`. Perhaps the language needs to be specified. Perhaps you or another reader can explain that. I don't know Javascript, by the way. – Cary Swoveland Apr 16 '20 at 06:31
  • Manuel, `(?<![a-záàâãéèêíïóôõöúçñ])([a-záàâãéèêíïóôõöúçñ])` seems to work. [Demo](https://regex101.com/r/vn9Evu/3/). – Cary Swoveland Apr 16 '20 at 06:40