1

This is my regex so far which will split on non-alphanumeric characters, including international characters (ie Korean, Japanese, Chinese characters).

title = '[MV] SUNMI(선미) _ 누아르(Noir)'
title.split(/[^a-zA-Z0-9 ']/)

this is the regex to match any international character:

[^\x00-\x7F]+

Which I got from: Regular expression to match non-English characters? Let'a ssume this is 100% correct (no debating!)

How do I combine these 2 so I can split on non-alphanumeric characters, excluding international characters? The easy part is done. I just need to combine these regex's somehow.

My expected output would be something like this

["MV", "SUNMI", "선미", "누아르", "Noir"]

TLDR: I want to split on non-alphanumeric characters only (english letters, foreign characters should not be split on)

Henley
  • 21,258
  • 32
  • 119
  • 207

2 Answers2

1

(?:[^a-zA-Z0-9](?<![^\x00-\x7F]))+

https://regex101.com/r/EDyluc/1

What is not matched (remains from split) is what you want to keep.

Explained:

 (?:
      [^a-zA-Z0-9]                  # Not Ascii AlphaNum
      (?<! [^\x00-\x7F] )           # Behind, not not Ascii range (Ascii boundary)
 )+

Let me know if you need a more detailed explanation.

0

So basically you want to split on all ascii but non-alphabet characters. You can use this regex which selects all characters within ascii range.

[ -@[-`{-~]+

This regex having ranges space to @ then ignoring all uppercase letters then picks all characters from [ to backtick then ignores all lowercase letters then picks all characters from { to ~ as can be seen in ascii table.

In case you want to exclude till extended ascii characters, you can change ~ in regex with ÿ and use [ -@[-{-ÿ]+` regex.

Demo

Check out these Ruby codes,

s = '[MV] SUNMI(선미) _ 누아르(Noir)'
puts s.split(/[ -@\[-`{-~]+/)

Prints,

MV
SUNMI
선미
누아르
Noir

Online Ruby Demo

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
  • This isn't precisely what I wanted, but I tried this: title.split(/[^\x00-\x7F]|[^a-zA-Z0-9 ]+/) which gives me ["[MV] SUNMI(", "", ") _ ", "", "", "(Noir)"] . I want it to split on non-alphanumeric characters only (english letters, foreign characters should not be split on) – Henley Mar 04 '19 at 16:08
  • @HenleyChiu: Did this work for you? Let me know if you have any trouble using it. – Pushpesh Kumar Rajwanshi Mar 14 '19 at 18:10