0

I know JavaScript regular expressions have native lookaheads but not lookbehinds.

I want to split a string at points either beginning with any member of one set of characters or ending with any member of another set of characters.

Split before , , , , . Split after .

In: ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ

Out: ເລື້ອຍໆມະ ຫັດສະ ຈັນ ເອກອັກຄະ ລັດຖະ ທູດ

I can achieve the "split before" part using zero-width lookahead:

'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ'.split(/(?=[ໃໄໂເແ])/)

["ເລື້ອຍໆມະຫັດສະຈັນ", "ເອກອັກຄະລັດຖະທູດ"]

But I can't think of a general approach to simulating zero-width lookbehind

I'm splitting strings of arbitrary Unicode text so don't want to substitute in special markers in a first pass, since I can't guarantee the absence of any string from my input.

Community
  • 1
  • 1
hippietrail
  • 15,848
  • 18
  • 99
  • 158

3 Answers3

2

Instead of spliting, you may consider using the match() method.

var s = 'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ',
    r = s.match(/(?:(?!ະ).)+?(?:ະ|(?=[ໃໄໂເແ]|$))/g);

console.log(r); //=> [ 'ເລື້ອຍໆມະ', 'ຫັດສະ', 'ຈັນ', 'ເອກອັກຄະ', 'ລັດຖະ', 'ທູດ' ]
hwnd
  • 69,796
  • 4
  • 95
  • 132
1

If you use parentheses in the delimited regex, the captured text is included in the returned array. So you can just split on /(ະ)/ and then concatenate each of the odd members of the resulting array to the preceding even member. Example:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[])

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນເອກອັກຄະ", "ລັດຖະ", "ທູ"]

You can do another pass to split on the lookahead:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[]).reduce(function(arr,str){return arr.concat(str.split(/(?=[ໃໄໂເແ])/));},[]);

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນ", "ເອກອັກຄະ", "ລັດຖະ", "ທູ"]

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • In a first pass before then doing a pass on the lookahead part? That's what I'm playing with right now (-: ... – hippietrail Aug 29 '14 at 02:58
  • There's one way in which this solution isn't general. If the "end" pattern can be a varying number of characters. This doesn't happen in my current iteration but may do so in the future, and more general solutions are more betterer (-: ... Then again I did specify "character" in my question. – hippietrail Aug 29 '14 at 03:07
  • I don't see how that would matter. The split pattern could just as easily be an alternation ... whatever the actual delimiter is in each case, it will still be included and appended to the previous string. – Mark Reed Aug 29 '14 at 03:09
  • You're right. I thought I saw something hard-coded on the length of ະ but not. – hippietrail Aug 29 '14 at 03:15
1

You could try matching rather than splitting,

> var re = /((?:(?!ະ).)+(?:ະ|$))/g;
undefined
> var str = "ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ"
undefined
> var m;
undefined
> while ((m = re.exec(str)) != null) {
... console.log(m[1]);
... }
ເລື້ອຍໆມະ
ຫັດສະ
ຈັນເອກອັກຄະ
ລັດຖະ
ທູດ

Then again split the elements in the array using lookahead.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274