0

Let's say I have the following string in javascript:

&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&

I want to remove all the leading and trailing special characters (anything which is not alphanumeric or alphabet in another language) from all the words.

So the string should look like

a.b.c a.b.c a.b.c a.b.c a.b&.c a.b.&&dc ê.b..c

Notice how the special characters in between the alphanumeric is left behind. The last ê is also left behind.

3 Answers3

3

This regex should do what you want. It looks for

  • start of line, or some spaces (^| +) captured in group 1
  • some number of symbol characters [!-\/:-@\[-``\{-~]*
  • a minimal number of non-space characters ([^ ]*?) captured in group 2
  • some number of symbol characters [!-\/:-@\[-``\{-~]*
  • followed by a space or end-of-line (using a positive lookahead) (?=\s|$)

Matches are replaced with just groups 1 and 2 (the spacing and the characters between the symbols).

let str = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
str = str.replace(/(^| +)[!-\/:-@\[-`\{-~]*([^ ]*?)[!-\/:-@\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);

Note that if you want to preserve a string of punctuation characters on their own (e.g. as in Apple & Sauce), you should change the second capture group to insist on there being one or more non-space characters (([^ ]+?)) instead of none and add a lookahead after the initial match of punctuation characters to assert that the next character is not punctuation:

let str = 'Apple &&& Sauce; -This + !That!';
str = str.replace(/(^| +)[!-\/:-@\[-`\{-~]*(?![!-\/:-@\[-`\{-~])([^ ]+?)[!-\/:-@\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);
Nick
  • 138,499
  • 22
  • 57
  • 95
  • This seems to be the best one liner way and it works perfectly for diacritics too. Thanks! – sudoExclaimationExclaimation Oct 20 '19 at 02:00
  • This removes characters from the middle of the string. It should only remove from end and beginning. Test with string like "Apple & Sauce", or "This + That" – siefix May 01 '21 at 21:27
  • @siefix *thank you* for leaving a comment to go with (what I presume is) your downvote. I believe the behaviour you describe is what OP asked for: "I want to remove all the leading and trailing special characters from **all** the words", not just the end and beginning of the string. Now we could argue as to whether `&` or `+` on their own should be stripped; they don't match the pattern of data in the question, but I think it's a reasonable interpretation of the wording of the question. Regardless, I've updated the answer with a regex that will not remove a string of punctuation on its own. – Nick May 02 '21 at 00:00
  • @Nick thanks for the update and additional solution. Yes I interpreted it as each was a separate distinct word, and the OP wanting to keep special characters in the middle (e.g. "a.b.&&dc"). Can see both ways. – siefix May 06 '21 at 17:39
1

a-zA-Z\u00C0-\u017F is used to capture all valid characters, including diacritics.

The following is a single regular expression to capture each individual word. The logic is that it will look for the first valid character as the beginning of the capture group, and then the last sequence of invalid characters before a space character or string terminator as the end of the capture group.

const myRegEx = /[^a-zA-Z\u00C0-\u017F]*([a-zA-Z\u00C0-\u017F].*?[a-zA-Z\u00C0-\u017F]*)[^a-zA-Z\u00C0-\u017F]*?(\s|$)/g;  
let myString = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&'.replace(myRegEx, '$1$2');
console.log(myString);
Harvard Pan
  • 79
  • 1
  • 4
0

Something like this might help:

const string = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
const result = string.split(' ').map(s => /^[^a-zA-Z0-9ê]*([\w\W]*?)[^a-zA-Z0-9ê]*$/g.exec(s)[1]).join(' ');
console.log(result);

Note that this is not one single regex, but uses JS help code.

Rough explanation: We first split the string into an array of strings, divided by spaces. We then transform each of the substrings by stripping the leading and trailing special characters. We do this by capturing all special characters with [^a-zA-Z0-9ê]*, because of the leading ^ character it matches all characters except those listed, so all special characters. Between these two groups we capture all relevant characters with ([\w\W]*?). \w catches words, \W catches non-words, so \w\W catches all possible characters. By appending the ? after the *, we make the quantifier * lazy, so that the group stops catching as soon as the next group, which catches trailing special characters, catches something. We also start the regex with a ^ symbol and end it with an $ symbol to capture the entire string (they respectively set anchors to the start end the end of the string). With .exec(s)[1] we then execute the regex on the substring and return the first capturing group result in our transform function. Note that this might be null if a substring does not include proper characters. At the end we join the substrings with spaces.

Lukas Bach
  • 3,559
  • 2
  • 27
  • 31
  • 1
    This doesn't leave behind the last `ê` – Nick Oct 20 '19 at 01:31
  • Yes thats what I explained in the post. Anyway, I've edited it to also leave behing ``ê`` characters. You can edit the the special char capturing groups the specify which characters you consider special characters. – Lukas Bach Oct 20 '19 at 01:51
  • This works pretty well. Only problem is that the `ê` is variable. Like it could have any of those types of alphabets. `à` for example. – sudoExclaimationExclaimation Oct 20 '19 at 01:53
  • Well you could list all possible special characters explicitly, for a universal matching on those letters you would need to match ranges of unicode codes. See e.g. https://stackoverflow.com/a/280762/2692307 on how to match unicode ranges. – Lukas Bach Oct 20 '19 at 02:08
  • This is incorrect. It removes special characters from the middle of the string. Test with a simple string like "This + That" – siefix May 01 '21 at 21:28