1

I'm trying to split a string based on either spaces or certain symbols (presently *_-<>). I'll give some examples of input and output:

"Hello how are you" -> [ "Hello", " ", "how", " ", "are", " ", "you" ]

"Hello *how* are *you*" -> [ "Hello", " ", "*how*", " ", "are", " ", "*you*" ]

"Hello *how*are_you_" -> [ "Hello", " ", "*how*", "are", "_you_" ]

"*how*are _you_ \*doing*_today_ hm?" -> [ "*how*", "are", " ", "_you_", " ", "\*doing*", "_today_", " ", "hm?"

Splitting on space unfortunately turns cases like *how*_are_ into a single item in the array instead of multiple items.

I also tried using a Regex to split on, but unfortunately it doesn't maintain the symbols surrounding each word.

Sorry if this is a bit confusing. Is there a good way to handle this?

Kamil Kiełczewski
  • 85,173
  • 29
  • 368
  • 345
Ryan Peschel
  • 11,087
  • 19
  • 74
  • 136
  • Possible duplicate of [Javascript and regex: split string and keep the separator](https://stackoverflow.com/questions/12001953/javascript-and-regex-split-string-and-keep-the-separator) – Harun Yilmaz Sep 16 '19 at 07:49
  • 1
    Do you mean `"*how*are _you_ \\*doing*_today_ hm?"`, with a real, literal backslash? – Wiktor Stribiżew Sep 16 '19 at 07:50
  • Yeah the literal backslash is intentional. @HarunYilmaz I'll check out that answer and find out if my question differs. This is a new concept to me (lookaheads) so it will take some time for me to figure out whether or not it works here. – Ryan Peschel Sep 16 '19 at 07:51
  • 1
    Probably, `s.match(/<[^\s*_<>-]+>|([*_-]?)[^\s*_<>-]+\1|\s+/g)` will help. – Wiktor Stribiżew Sep 16 '19 at 07:56
  • @WiktorStribiżew wow that seems to work with all my inputs. I'm going to spend the next 30 minutes studying that regex to try and understand it. – Ryan Peschel Sep 16 '19 at 08:04
  • 1
    @RyanPeschel I do not quite get it: do you need to match `>word>` and `` like things as a single token. – Wiktor Stribiżew Sep 16 '19 at 08:11
  • @WiktorStribiżew Sorry about that, I think I messed up the original post (but your regex is probably enough to get me from 95% of the way there to 100%). Thankfully I don't think that part matters because I am going to have to do re-parse the items again anyways to conditionally turn them into React components (small Markdown subset). So even if it's more general than it needs to be the second parse should catch it. – Ryan Peschel Sep 16 '19 at 08:18

2 Answers2

3

Rather than using split, one option is to use .match: either match one of the symbols, followed by characters that aren't that symbol, followed by that symbol again, or match non-space, non-symbol characters:

// Put the dash first, because it will be put into a character set:
const delims = '-*_<>';

// Construct a pattern like:
// ([-*_<>])(?:(?!\1).)+\1| |[^-*_<> ]+

const patternStr = String.raw
`([${delims}])(?:(?!\1).)+\1| |[^${delims} ]+`
const pattern = new RegExp(patternStr, 'g');

const doMatch = str => str.match(pattern);
console.log(doMatch("Hello how are you"));
console.log(doMatch("Hello *how*are_you_"));
console.log(doMatch("*how*are _you_ \*doing*_today_ hm?"));

([-*_<>])(?:(?!\1).)+\1|[^-*_<> ]+ means:

  • ([-*_<>])(?:(?!\1).)+\1 - First alternation:
    • ([-*_<>]) - Match and capture initial delimiter
    • (?:(?!\1).)+ - Followed by any characters which are not that initial delimitier
    • \1 - Followed by that initial delimiter again
  • \s Second alternation: match a space
  • [^-*_<> ]+ - Third alternation: match anything which is not a delimiter or a space
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Your answer looks to use pretty similar techniques to Wiktor's, except with the nice delims keyword to remove some repetition and his preserving the spaces. It also seems to work well, I just need to figure out _how_ it works now. Thanks for the explanation for how it works! Going to study this. – Ryan Peschel Sep 16 '19 at 08:06
  • Yeah I'm using WiktorStribiżew's regex but I appreciate this answer anyways because it helps explain it and is close enough. It also gave me the idea to use a delimiter variable to make it more readable. – Ryan Peschel Sep 16 '19 at 08:20
  • Ah, you wanted to capture the spaces - that's an easy tweak, just alternate with a space too – CertainPerformance Sep 16 '19 at 08:21
0

Try (this is improved CertainPerformance answer )

let split = s => s.match(/([-*_<>])(?:(?!\1).)+\1| |[^ ]+/g) 

console.log(split("Hello how are you"));
console.log(split("Hello *how* are *you*"));
console.log(split("Hello *how*are_you_"));
console.log(split("*how*are _you_ \*doing*_today_ hm?"));
Kamil Kiełczewski
  • 85,173
  • 29
  • 368
  • 345