0

The following regex (taken from here) splits a string by characters length (e.g. 20 characters), while being word-aware (live demo):

\b[\w\s]{20,}?(?=\s)|.+$

This means that if a word should be "cut" in the middle (based on the provided characters length) - then the whole word is taken instead:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b[\\w\\s]{${substringMaxLength},}?(?=\\s)|.+$`, 'g');

const substrings = str.match(regex);

console.log(substrings);

However, as can be seen when running the snippet above, the leading whitespace is taken with each substring. Can it be ignored, so that we'll end up with this?

[
  "this is an input example",
  "of one sentence that",
  "contains a bit of words",
  "and must be split"
]

I tried adding either [^\s], (?:\s), (?!\s) everywhere, but just couldn't achieve it.

How can it be done?

OfirD
  • 9,442
  • 5
  • 47
  • 90

3 Answers3

0

You can require that every match starts with \w -- so for both options of your current regex:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?=\\s)|.*$)`, 'g');

const substrings = str.match(regex);

console.log(substrings);
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Seems like an ending `?` is needed to capture a single character. – OfirD Oct 13 '22 at 21:00
  • It it is the last character of the input, it is not needed, as `.*$` can match an empty string, and if it is not the last character of the input, then certainly more should be captured (19 to go...) with the first option. – trincot Oct 13 '22 at 21:05
  • It's not working when the string contains some punctuation or special characters. Could you please also provide the solution for the following string? `this is an input, example of-one sentence. that contains! a bit of words and; must be split.` – Awolad Hossain Dec 12 '22 at 07:05
0

This is how you can do it:

const regex = new RegExp(`\\b((?:[^\\s]+\\s?){${substringMaxLength},}?)(?=\\s)|.+$`, 'g');

The regex uses a non-capturing group with a positive lookahead (?=\s) to prevent whitespace from being captured. The lookahead checks if there is a whitespace after the group and if there is whitespace it returns a match. The non-capturing group uses a positive look behind (?<=\s) to make sure that the group starts with whitespace. \b((?:[^\s]+\s?){20,}?)\b(?=\s) Regex Demo

Gwhyyy
  • 7,554
  • 3
  • 8
  • 35
Sh_gosha
  • 111
  • 2
0

Your pattern can start with a word character and the length minus 1.

The negative lookahead (?!\S) asserts a whitespace boundary to the right.

The alternative matches the rest of the line, and also starta with a word character.

\b\w(?:[\w\s]{19,}?(?!\S)|.*)

Regex demo

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?!\\S)|.*)`, 'g');

const substrings = str.match(regex);

console.log(substrings);
The fourth bird
  • 154,723
  • 16
  • 55
  • 70