0

I'm working with string which contains both english and Chinese characters. I want to single out each english word , non-english characters e.g. french, chinese etc., numbers and special characters e.g. "@#$%^&>?" for further manipulation.

So I tired

var nregex = /[^\u0000-\u007F]|[a-z]+|\d|[!@#$%^&*()_+\-=\[\]{};':"\\|,.<>\/?]/ig

It works for most of the case, but I'm worried some special characters or emoji not included in the list of my code.

Is there an easier way other than list all special characters as I did?

zoyb
  • 85
  • 9
  • 1
    Well, it is because `\S+` matches all what the previous branch matches. Did you try `/[\u00ff-\uffff]|[a-z]+/ig`? – Wiktor Stribiżew May 30 '16 at 10:26
  • @WiktorStribiżew Thanks! But this fails to include the numeric and other special characters such as 1,2,3,&,#,@,* etc. – zoyb May 30 '16 at 10:35
  • Please provide some more sample inputs with exact expected output, as the question has become much less clear. Do you want to match `iloveyou` as 3 different words? – Wiktor Stribiżew May 30 '16 at 10:41
  • Possible duplicate of [How to do word counts for a mixture of English and Chinese in Javascript](http://stackoverflow.com/questions/20396456/how-to-do-word-counts-for-a-mixture-of-english-and-chinese-in-javascript) – Pedro Lobito May 30 '16 at 10:42
  • @PedroLobito: It looks like [that solution](http://stackoverflow.com/questions/20396456/how-to-do-word-counts-for-a-mixture-of-english-and-chinese-in-javascript) does not work for OP, see the above comment. – Wiktor Stribiżew May 30 '16 at 10:43
  • http://stackoverflow.com/a/32961117/797495 – Pedro Lobito May 30 '16 at 10:45
  • I suspect OP wants to read `iloveyou` as `i`, `love`, `you`, and thus some NLP package is required. – Wiktor Stribiżew May 30 '16 at 10:55
  • @WiktorStribiżew Sorry for late response as I was working on the code. I don't need to cut "iloveyou" into three pieces, but my question has expanded to include all special characters and non-english characters. So I tried `var nregex = /[^\u0000-\u007F]|[a-z]+|\d|[!@#$%^&*()_+\-=\[\]{};':"\\|,.<>\/?]/ig` to do the work, but still worried if user typed in special character or imoji not included in the special character list. – zoyb May 30 '16 at 11:07
  • Please update the question. I guess html tag does not have anything to do with it, add emoji if you care about them. – Wiktor Stribiżew May 30 '16 at 11:12
  • I've posted an answer that may help you. Iit's not the perfect solution but will definitely work for your example. – Pedro Lobito May 30 '16 at 11:23
  • why did you change you answer?! it's completely different from the original. – Pedro Lobito May 30 '16 at 11:27
  • @WiktorStribiżew just modified the question, but I don't know if it's OK to modified the question which is far away from the original question, since some people may had already involved with the original question. – zoyb May 30 '16 at 11:28
  • Look, your original question was far from being clear. Now, it is still far from being clear. You should provide an [MCVE](http://stackoverflow.com/help/mcve). The regex you showed implies getting all the `[^\u0000-\u007F]` symbols, `[a-z]+` chunks, `\d` digits and then anything but these 3 (that is how I read it). So, you could try `[^\u0000-\u007F]|[a-z]+|\d|(?![^\u0000-\u007F]|[a-z\d]).`. – Wiktor Stribiżew May 30 '16 at 11:41
  • Well, what about [`/[^\u0000-\u007F]|[\u0000-\u0008\u0011\u0012\u0014-\u0019\u0021-\u007F]+/g`](https://regex101.com/r/qN9jI7/2)? – Wiktor Stribiżew May 30 '16 at 12:53
  • It works perfect with english, other languages and special characters.However the numbers are not selected with the code you helped with. How to modify it to include numbers? For example `abc123` would output ["abc",1,2,3] . `1517` outputs [1,5,1,7], 123abc outputs [1,2,3,"abc"], `我爱你123` outputs [我,爱,你,1,2,3], `123我爱你` outputs [1,2,3,我,爱,你]. – zoyb May 30 '16 at 15:51
  • Where do I learn how unicode match to regex, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp could'nt help. – zoyb May 30 '16 at 15:54

1 Answers1

2

Not the perfect solution, and you may need to tweak it, but will work for the example given:

string2 = "I love you 我爱你"
englishChars = string2.replace(/[^a-z ]/ig, "").trim().split(/\s+/);
nonEnglishChars = string2.replace(/[a-z ]/ig, "").split(/[ ]*/);
final = englishChars.concat(nonEnglishChars);
console.log(final);
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268