8

if this type character '這' = NonEnglish each will take up 2 word space, and English will take up 1 word space, Max length limit is 10 word space; How to get the first 10 space.
for below example how to get the result This這 is?
I'm trying to use for loop from first word but I don't know how to get each word in string...

string = "This這 is是 English中文 …";

var NonEnglish = "[^\u0000-\u0080]+",
    Pattern = new RegExp(NonEnglish),
    MaxLength = 10,
    Ratio = 2;
user1775888
  • 3,147
  • 13
  • 45
  • 65

2 Answers2

8

If you mean you want to get that part of the string where it's length has reached 10, here's the answer:

var string = "This這 is是 English中文 …";

function check(string){
  // Length of A-Za-z characters is 1, and other characters which OP wants is 2
  var length = i = 0, len = string.length; 

  // you can iterate over strings just as like arrays
  for(;i < len; i++){

    // if the character is what the OP wants, add 2, else 1
    length += /\u0000-\u0080/.test(string[i]) ? 2 : 1;

    // if length is >= 10, come out of loop
    if(length >= 10) break;
  }

  // return string from the first letter till the index where we aborted the for loop
  return string.substr(0, i);
}

alert(check(string));

Live Demo

EDIT 1:

  1. Replaced .match with .test. The former returns a whole array while the latter simply returns true or false.
  2. Improved RegEx. Since we are checking only one character, no need for ^ and + that were before.
  3. Replaced len with string.length. Here's why.
Community
  • 1
  • 1
HighBoots
  • 293
  • 1
  • 5
  • is it possible to use variable `i` out of scope of for loop? – Mr_Green Feb 27 '14 at 05:32
  • 2
    Just be careful as this can take time long time to process because you used regex on "each character of string" – fedmich Feb 27 '14 at 05:54
  • 1
    @user1775888 I agree with fedmich. [Here's a video which shows how](https://www.facebook.com/photo.php?v=709991675708445). And see my answer edit, I added some things which might increase the speed significantly with larger strings. – HighBoots Feb 27 '14 at 09:56
0

I'd suggest something along the following lines (assuming that you're trying to break the string up into snippets that are <= 10 bytes in length):

string = "This這 is是 English中文 …";

function byteCount(text) {
    //get the number of bytes consumed by a string
    return encodeURI(text).split(/%..|./).length - 1;
}

function tokenize(text, targetLen) {
    //break a string up into snippets that are <= to our target length
    var result = [];

    var pos = 0;
    var current = "";
    while (pos < text.length) {
        var next = current + text.charAt(pos);

        if (byteCount(next) > targetLen) {
            result.push(current);
            current = "";
            pos--;
        }
        else if (byteCount(next) == targetLen) {
            result.push(next);
            current = "";
        }
        else {
            current = next;
        }

        pos++;
    }
    if (current != "") {
       result.push(current);
    }

    return result;
};

console.log(tokenize(string, 10));

http://jsfiddle.net/5pc6L/

aroth
  • 54,026
  • 20
  • 135
  • 176