How to extract urls from a string that doesn't contain https or www

Question

Consider a string

let a =  "I visit google.com often times but.. not amazon.uk"

How to extract google.com and amazon.uk from the string above in JavaScript

`[a-zA-Z0-9]+\.[a-zA-Z0-9]{2,}` might do the trick for most sites. but i strongly against this kind of approach only - its very inaccurate. you should try to capture the second group and test it against [known list of tld](https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains). also, if you take a look on the rfc (forgot the exact number) for domain names, you will find that entire unicode (non-modern latin alphabet) is valid. cmiiw. — Bagus Tesa, Jun 22 '22 at 14:56
this [QA regarding regex for capturing url](https://stackoverflow.com/q/3809401) is a nice start. it would be best if you could: 1) check valid tlds; 2) check if the actual site is on DNS record. — Bagus Tesa, Jun 22 '22 at 14:59
@Naveed Thanks for your solution but it solves only if there is .com or .uk I want to take all the urls even if it contains some other domain extension — Lahfir, Jun 23 '22 at 08:58
@Lahfir, you an add those domains here delimited with the pipe and it will work (.uk|.com). — Naveed, Jun 23 '22 at 13:10
we would either needs to know a pattern to identify these as domains or a list of the domains we want to search. the Solution presented with work when you have a list of domains already identified and answers the question from that standpoint — Naveed, Jun 23 '22 at 13:35

score 0 · Answer 1 · answered Jun 22 '22 at 14:42

0

Try this :

let a =  "I visit google.com often times but.. not amazon.uk"
a.match(/("[^"]+"|[^"\s]+)/g);

Output:

[
    "I",
    "visit",
    "google.com",
    "often",
    "times",
    "but..",
    "not",
    "amazon.uk"
]

answered Jun 22 '22 at 14:42

yanir midler

2,153
1
4
16

Thanks for the answer but what if there is a domain with some other extension .io or something? Do you suggest to store the list of extensions in an array and compare with that? – Lahfir Jun 22 '22 at 14:48
I think you need write a custom parser for it – Shkar Sardar Jun 22 '22 at 14:50

Naveed · Answer 2 · 2022-06-22T21:19:10.000

0

Here is one way to do it

\s(\w+)(.uk|.com)\b

here is a fiddle link for Javascript

https://jsfiddle.net/y25wz3ae/

https://regex101.com/r/HFyxEJ/1

Result [('google', '.com'), ('amazon', '.uk')]

edited Jun 22 '22 at 21:19

answered Jun 22 '22 at 19:35

Naveed

11,495
2
14
21

score -1 · Answer 3 · answered Jun 22 '22 at 19:14

To solve this problem I've created an API to extract URLs from a string or an array of strings

Base Url -> https://urlsparser.herokuapp.com/

GET https://urlsparser.herokuapp.com/url

For a single string

{
  "string" : "More here http://action.mySite.com/trk.php?mclic=P4CAB9542D7F151&urlrv=http%3A%2F%2Fjeu-centerparcs.com%2F%23%21%2F%3Fidfrom%3D8&urlv=517b975385e89dfb8b9689e6c2b4b93d text<br/>And more here http://action.mySite.com/trk.php?mclic=P4CAB9542D7F151&urlrv=http%3A%2F%2Fjeu-centerparcs.com%2F%23%21%2F%3Fidfrom%3D8&urlv=517b975385e89dfb8b9689e6c2b4b93d"
}

For an array of strings

{
  "string" : ["string1","string2"....]
}

Screenshot

Advantages

Has more than 900 domain extensions [.com,.io,....]
Faster, extracts result in less than 20ms

How to extract urls from a string that doesn't contain https or www

3 Answers3

Linked