How to split text into an array of URLs and space-separated phrases?

Question

I want to split a text based on URLs.

So a text like

const text = 'hello world, testing https://stackoverflow.com/questions/ask this is prefix https://gmail.com final text'

should give

const result = [
    'hello world, testing',
    'https://stackoverflow.com/questions/ask',
    'this is prefix',
    'https://gmail.com',
    'final text'
]

Basically any URL should split the text but the URL should also be included

I did try out a few things but was not able to create an algorithm for this.

/(http|https):\/\/[a-zA-Z0-9\-.]+\.[a-zA-Z]{2,3}(\/\S*)?/

I did try to split with this regex but its not consistent

_I did try out a few things_ ... please provide what you have tried. — aca, Feb 07 '23 at 10:59
you can make the regex simpler just targeting any character not being whitespace `/(https?:\/\/[^\s]+)/` for example. if you pass that regex to `string.split` will return the array you were expecting — Diego D, Feb 07 '23 at 11:01
If you want to split using regex to match any valid URL, try one of the solutions here https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url — Stitt, Feb 07 '23 at 11:02
Can it be: ["hello, world, testing", "https://...."], or "hello world" must go without comma between? — aca, Feb 07 '23 at 11:03
@DiegoD that's not a valid URI: `"https://"` per-se. You might unwantedly catch strings like `"use the scheme https:// instead"`. If you match it (in order to presumably construct a link) - that's a wrong approach. — Roko C. Buljan, Feb 07 '23 at 11:05
@RokoC.Buljan I agree but that string doesn't adhere to any formal language so actually there will always be ambiguity. An url when matching that pattern will be considered an url separator for the split function. Now we can decide to skip invalid urls ok... but yet I may argue that a valid url could be an escape sequence or not and there's no way to determine. Additionally the pattern I suggested expects at least one character following the schema (weak point I know) — Diego D, Feb 07 '23 at 11:07
@PradipShrestha I hope you "tested" and realized that the regex you're trying to use is not correct: https://regex101.com/r/s2CTBk/1 — Roko C. Buljan, Feb 07 '23 at 11:09
@DiegoD I did try to split with that regex but ``` 'https://stackoverflow.com/questions hello world https://stackoverflow.com/questions'.split(/((https|http)?:\/\/[^\s]+)/) ``` It does not produce the correct output — Pradip Shrestha, Feb 07 '23 at 11:11
@DiegoD yeah, we can conclude that it highly depends on the use-case. If an exact URI pattern match is needed, by only doing `[^\s]+` - that's not enough. Also, there's the shorter `\S+`. — Roko C. Buljan, Feb 07 '23 at 11:14
@PradipShrestha starting from the clear fact that my very dry regex isn't foolproof for several reasons... but yet it should work for those scenarios. So I wonder why it doesn't produce the expected output. it was as easy as `text.split(/(https?:\/\/[^\s]+)/)`. You used a different pattern I see.. and consider that capturing groups (the parenthesis) affects how the delimiter gets appended to the result — Diego D, Feb 07 '23 at 11:14
@RokoC.Buljan yes and I strongly I agree with your objections of course. It was important to point out all the consequences of such approach. Also the way you suggested to further simplify the `[^\s]`. That pattern is enough for the shown scenario but could easily fail when there is `http://sdf` not intended as a url. And anyway the url fetched won't be granted to be valid — Diego D, Feb 07 '23 at 11:16
I think this regex will fit my use case though. Thanks, @DiegoD and RokoC.Bulijan Ill happily accept the answer if you post it — Pradip Shrestha, Feb 07 '23 at 11:18

score 5 · Accepted Answer · edited Feb 07 '23 at 13:08

5

You can use .split using this regex with a capture group:

\s*(https?:\/\/\S+)\s*

RegEx Demo

Code:

const text = 'hello world, testing https://stackoverflow.com/questions/ask this is prefix https://gmail.com final text';

var arr = text.trim().split(/\s*(https?:\/\/\S+)\s*/);

console.log(arr);

/*
['hello world, testing',
'https://stackoverflow.com/questions/ask',
'this is prefix',
'https://gmail.com',
'final text']
*/

Break down of the RegEx:

\s*: Match 0 or more whitespaces
(https?:\/\/\S+): Match any URL that starts with http:// or https:// followed by 1+ non-whitespace characters. Capture this in group #1 to be able to get this in the resulting array.
\s*: Match 0 or more whitespaces

edited Feb 07 '23 at 13:08

Arvind Kumar Avinash

71,965
6
74
110

answered Feb 07 '23 at 11:18

anubhava

761,203
64
569
643

Why `\s*` instead of i.e: `\b`? – Roko C. Buljan Feb 07 '23 at 11:21
@RokoC.Buljan: That is to eliminate trailing and leading whitespaces in resulting array from `split` – anubhava Feb 07 '23 at 11:23
That's exactly why I'm asking: https://jsfiddle.net/mvw351dk/ Am I missing something? – Roko C. Buljan Feb 07 '23 at 11:27
3

That's hilarious: I've been working in web development for over 8 years, during which time I went from a complete newbie to a fairly experienced developer, but I did never know that [using a capture group includes separator into resulting array](https://i.stack.imgur.com/OQQ09.png). Cool trick! – nicael Feb 07 '23 at 11:35
@RokoC.Buljan: Try this: https://jsfiddle.net/Lyxj86h4/ – anubhava Feb 07 '23 at 11:45

How to split text into an array of URLs and space-separated phrases?

1 Answers1