Find web links without elements in a plain content using perl regex or module

Question

My Input file its a plain content:

Fries Scheepvaartmuseum: Schiffmodelle in jeglichen Größen und viele Infos über Schiffsbau und Seefahrt sowie über die Geschichte der Stadt Sneek. *www.friesscheepvaartmuseum.nl** Museen sowie facebook.com viele kleine Gassen zwischen den https://facebook.com Grachten locken zu Erkundungstouren. Der Strand lädt zu romantischen Spaziergängen ein https://stackoverflow.com/questions/tagged/perl nicht nur probieren und kaufen, sondern auch das nostalgische Haus und die Destillerie besichtigen stackoverflow.com/questions/tagged/perl

I can able to find www.<sample>.<edu|com|af|ag|ai|al|etc>, https?://<sample>.<edu|com|af|ag|ai|al|etc> with prefix (www, http) and suffix (list of domains).

However, I need to find the links based on the some list of domains like ... .edu, .com, .af, .ag, .ai, .al without prefix and suffix in the web links.

For example:

I couldn't able to find incomplete or without prefix www, https, http links like facebook.com, stackoverflow.com/questions/tagged/perl in a plain contents.

Could you please someone help me on this one if there is any module is available or any regex patterns would be helpful for me since I have more than 10k web links to find.

some one given down votes without reason... how can I improve my self? — ssr1012, Feb 03 '20 at 13:12
See also [Using regex to extract URLs from plain text with Perl](https://stackoverflow.com/q/1053349/2173773) — Håkon Hægland, Feb 03 '20 at 14:21
@HåkonHægland: Thanks for your weblink and I checked the same however I need to fetch without `http`, `www`...etc., could you please advice. — ssr1012, Feb 04 '20 at 06:05
Can you give an example of expected output some sample text? That would help clarify what you want — Håkon Hægland, Feb 04 '20 at 07:41
`facebook.com` should be ` facebook.com` eg. facebook. In live might 10k domains available in files. — ssr1012, Feb 04 '20 at 07:45
What do you mean by *"I need to find the links ... without suffix"*? What is meant by suffix? — Håkon Hægland, Feb 04 '20 at 07:59
Not end with list of domains like `.com, .edu, .org` etc. For Eg. `stackoverflow.com/something/blahblah`. — ssr1012, Feb 04 '20 at 08:47
How can you know what prefix to insert? For example `facebook.com` and `stackoverflow.com/questions`. How does the script know that `facebook.com` should have `www` prefix, and `stackoverflow.com` should **not** have `www` prefix? — Håkon Hægland, Feb 04 '20 at 10:19
We have list of rules to add `http` or `https` or `www`. we can add later that once we fetched the string which is eg. `facebook.com and stackoverflow.com and yahoo.co.in` — ssr1012, Feb 04 '20 at 10:40
Can you use [URI::Find::Schemeless](https://metacpan.org/pod/URI::Find::Schemeless) ? — Håkon Hægland, Feb 04 '20 at 10:48
Already I have checked that I don't know where could I find this function `&callback` in `URI::Find::Schemeless` — ssr1012, Feb 04 '20 at 11:25
The callback is described in the documentation for [URI::Find](https://metacpan.org/pod/URI::Find) — Håkon Hægland, Feb 04 '20 at 12:22
Could you please provide a sample for `callback`... I am sincere apologies since I am not able to understand that description. — ssr1012, Feb 04 '20 at 13:53

score 1 · Accepted Answer · answered Feb 04 '20 at 14:06

Here is an example using URI::Find::Schemeless:

use feature qw(say);
use strict;
use warnings;
use URI::Find::Schemeless;

my $text = do { local $/; <DATA> };
my $finder = URI::Find::Schemeless->new(\&callback);
my $how_many_found = $finder->find(\$text);

sub callback {
    my ( $uri, $orig_text ) = @_;
    say "Found: ", $orig_text;
}

__DATA__
Fries Scheepvaartmuseum: Schiffmodelle in jeglichen Größen und viele Infos über Schiffsbau und Seefahrt sowie über die Geschichte der Stadt Sneek. *www.friesscheepvaartmuseum.nl** Museen sowie facebook.com viele kleine Gassen zwischen den https://facebook.com Grachten locken zu Erkundungstouren. Der Strand lädt zu romantischen Spaziergängen ein https://stackoverflow.com/questions/tagged/perl nicht nur probieren und kaufen, sondern auch das nostalgische Haus und die Destillerie besichtigen stackoverflow.com/questions/tagged/perl

Output:

Found: facebook.com
Found: https://facebook.com
Found: https://stackoverflow.com/questions/tagged/perl
Found: stackoverflow.com/questions/tagged/perl

Find web links without elements in a plain content using perl regex or module

1 Answers1