-2

I would like to split a large string by comma|semicolon into n-max-size chunks.

This similar question is very close to my situation, but what I really want is splitting by comma|semicolon, with n_max_size limit.

My situation: Using Text-to-Speech service for translating text to voice,but since the limit of the service provider, each request that has max 100 words limit, so I have to split an article to several substrings. If I just split it into fixed n-size, the pause/tone of the voice is not as same as a human.

What would be the best way in terms of performance to do this?

Allen
  • 6,745
  • 5
  • 41
  • 59
  • 1
    What is the processing you will do on each chunk? If that processing is less than trivial, then the performance of the split will probably not be that important. Anyway, what is your current attempt (code), and what is not working? – trincot Jul 25 '19 at 12:25
  • Solve the problem first, optimize later. – Frieder Jul 25 '19 at 12:34
  • @trincot I just update my question. – Allen Jul 25 '19 at 12:34
  • @Frieder Yeah, that's true, I've solved the problem yet, but wanna do better more currently. :) – Allen Jul 25 '19 at 12:35
  • If you have a working solution and have questions about efficiency, then the question is more suitable for [CodeReview](https://codereview.stackexchange.com/). If however you don't have code that works exactly as you really would want it, then share your code and describe the problem it has. – trincot Jul 25 '19 at 14:03

1 Answers1

1

From comments I understand you don't want to split at each comma or semi-colon, but only when the maximum size is about to be reached. Also you want to keep the delimiter (the comma or semi-colon where you split at) in the result.

To add a max-size limit to the regular expression, you can use a regex like .{1,100}, where 100 is that maximum (for example). If your engine does not support the dotAll flag (yet), then use [^] instead of . to ensure that even newline characters are matched here.

To ensure that the split happens just after a delimiter, add (.$|[,;]) to the regex, and reduce the previous {1,100} to {1,99}.

Then there is the case where there is no delimiter in a substring of 100 or more characters: the following code will choose to then exceptionally allow a longer chunk, until a delimiter is found. You may want to add white space (\s) as a possible delimiter too.

Here is a function that takes the size as argument and creates the corresponding regex:

const mySplit = (s, maxSize=s.length) => s.match(new RegExp("(?=\\S)([^]{1," + (maxSize-1) + "}|[^,;]*)(.$|[,;])", "g"));

console.log(mySplit("hello,this is a longer sentence without commas;but no problem", 20));
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Thanks very much. But how to make each split as close as possible to the maxSize? – Allen Jul 25 '19 at 12:50
  • What do you mean with "as close as possible"? Are you suggesting that it could be a chunk that is greater in size than maxSize? This solution will never go beyond maxSize, but will get the greatest chunk *within* that size. I thought that was what the "max" in *"`n_max_size` limit"* meant ;-) – trincot Jul 25 '19 at 13:53
  • This is my test text '春雪等不来青青杨柳, 留一抹清寒落在枝头; 四月的风足够温柔, 却化不开夜色里浓重的愁; 时光它走啊走, 停在石板路的尽头; 是同岁月纠缠的无止无休,和故事里永远不会少的二两老酒; 大雨把心思都淋透, 假装看不见这世界有多丑; 梦里哼着小曲儿的你容颜依旧温柔, 随天地逍遥游游,慢舸轻舟.', with `mySplit(text, 100)` got result `'[ '春雪等不来青青杨柳', '留一抹清寒落在枝头','四月的风足够温柔','却化不开夜色里浓重的愁','时光它走啊走','停在石板路的尽头','是同岁月纠缠的无止无休','和故事里永远不会少的二两老酒','大雨把心思都淋透','假装看不见这世界有多丑','梦里哼着小曲儿的你容颜依旧温柔','随天地逍遥游游,慢舸轻舟.' ]'`But I expect `[‘春雪等不来青青杨柳, 留一抹清寒落在枝头; 四月的风足够温柔, 却化不开夜色里浓重的愁; 时光它走啊走, 停在石板路的尽头; 是同岁月纠缠的无止无休,和故事里永远不会少的二两老酒; ’, ‘大雨把心思都淋透, 假装看不见这世界有多丑; 梦里哼着小曲儿的你容颜依旧温柔, 随天地逍遥游游,慢舸轻舟.’]` two splits, first split is 92 length, as close as `Max`. – Allen Jul 26 '19 at 03:20
  • I had misunderstood your question. I have updated my answer. – trincot Jul 26 '19 at 07:52