Regex to match URLs with and without domain name in python?

Question

This is the input sample:

<!DOCTYPE html>
<meta property="og:image" content="http://www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="https://mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="http://mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<link rel="preconnect" href="https://cdn.ncbi.nlm.nih.gov">
  <link rel="preconnect" href="https://www.ncbi.nlm.nih.gov">
  <link rel="preconnect" href="https://www.google-analytics.com">
    <link rel="stylesheet" href="https://cdn.ncbi.nlm.nih.gov/pubmed/c6188713-b612-4503-95e5-39fd91b4d5be/CACHE/css/output.35e8b192ea09.css" type="text/css">
  <link rel="stylesheet" href="https://cdn.ncbi.nlm.nih.gov/pubmed/c6188713-b612-4503-95e5-39fd91b4d5be/CACHE/css/output.452c70ce66f7.css" type="text/css">
    <meta property="og:image" content="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
    <meta property="og:title" content="MyDomainName.com"/>
    <meta property="og:url" content="https://www.mydomainname.com/"/>
    <meta property="og:description" content="MyDomainName.com, free, updated outline surgical domain clinical gjkjjhkl lkjkhkj jobs, conferences, fellowships, books" />        
    <link rel="apple-touch-icon" href="https://www.mydomainname.com/apple-touch-icon.png">
    <link rel="icon" sizes="180x180" href="https://www.mydomainname.com/apple-touch-icon.png">
    <link rel="shortcut icon" href="https://www.mydomainname.com/pathout_themes/pathology_outlines/favicon.ico" />
    <!--<link href="https://www.mydomainname.com/pathout_themes/pathology_outlines/css/reset.css" rel="stylesheet" type="text/css" />-->
    <link rel="stylesheet" href="https://www.mydomainname.com/pathout_core/plugins/font-awesome-4.6.3/css/font-awesome.min.css" type="text/css" /><!-- font awesome -->
    <!--<link href='http://fonts.googleapis.com/css?family=Lato:400,700,700italic,400italic,900,900italic' rel='stylesheet' type='text/css'>-->
    <link href="https://www.mydomainname.com/pathout_core/plugins/lightbox/css/lightbox.css" rel="stylesheet" />
    <link href="https://www.mydomainname.com/pathout_themes/pathology_outlines/css/main.css?ver=1.3.4" rel="stylesheet" />
            <!--<script id="Cookiebot" src="https://consent.cookiebot.com/uc.js" data-cbid="8539fe38-38ce-448a-beb8-edb691018d2d" data-blockingmode="auto" type="text/javascript"></script>-->
        <img src="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/ntbg3.png" width="1" height="1" alt="Image 01" />
        <img src="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/loader-sprite.png" width="1" height="1" alt="Image 02" />
    <script async defer src="https://scripts.simpleanalyticscdn.com/latest.js"></script>
    <noscript><img src="https://queue.simpleanalyticscdn.com/noscript.gif" alt="" referrerpolicy="no-referrer-when-downgrade" /></noscript>
    <!--<script async defer src="https://scripts.simpleanalyticscdn.com/auto-events.js"></script>-->
    <script async src="https://scripts.simpleanalyticscdn.com/auto-events.js" /></body>
</html>

It is a page html source code modified by myself in order to have more links.

I want to match in Python using regex module all the URLs that start mydomainname.com but also those that don't have domain name specified like /pathout_themes/pathology_outlines/images/logo_for_social.jpg

I started from this:

(https?:\/\/)|(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

that maches all urls (no matter the domain name) but it doesn't match those URLs without domain name like the one given example above. This one is doing basically the same thing:

(?:(http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)+([\w.,@?^=%&;:\/~+#-]*[\w@?^=%&;\/~+#-])?

Then I was trying this:

(https://www.mydomainname.com/)|(w-)(/)?

but it is not matching full URLs, it matches only the domain name. And also it doesn't match urls without domain name specifed like that one above.

I need regex to match all URLs with and without domain name specified too.

I have already read the following topics:

Regex match the Domain name

Need a regex to match URL (without http, https, ftp)

help rewriting this url. simple but not working

Unfortunately none of them is what I need.

@WiktorStribiżew I need to use this regex in Python so that is why I tested it on pythex.org and it is not working as expected because: 1. It matches also content= or href= or src= too. 2. It matches all urls, which is good, but I need only urls starting with mydomainname.com (+/-http(s) and/or www) and also urls that don't have any domain name. e.g. src="/mypath/dir/img.jpg" 3. What if urls is inside the source code but it doesn't have content, href or src before it? Thank you in advance! — YoYoYo, Sep 27 '22 at 11:20

score 2 · Accepted Answer · answered Sep 27 '22 at 12:32

I think you're on the right track. I basically used your original regex to match the patterns including the domain name. For those with a relative URL, I chose to require quotes around the pathname and require that the pathname start with /. You'll have to look at your data to decide if those restrictions make sense for you or not. For example, you can remove the "quote" groups, but you will match any content like /jk in your input.

import re
domainname_re = '(?P<match>(http|https|ftp)://www.mydomainname.com[a-zA-Z0-9@:%\._\+\/\&~##
=]*)'
nondomainname_re = '(?P<quote>[\'"])(?P<match2>\/[a-zA-Z0-9@:%\._\+~#=\/\&]*)(?P=quote)'
all_re = '(' + domainnname_re + '|' + nondomainname_re + ')'

# to test on your test input:
for match in re.finditer(all_re, testinput):
  matchdict = match.groupdict()
  print(matchdict['match'] or matchdict['match2'])

To be clear, the final pattern I used is:

'((?P<match>(http|https|ftp)://www.mydomainname.com[a-zA-Z0-9@:%\._\+\/\&~#=]*)|(?P<quote>['"])(?P<match2>\/[a-zA-Z0-9@:%\._\+~#=\/\&]*)(?P=quote))'

Hopefully that helps you find your next step!

This is what I need and I think you are right about the quotes. Thank you! P.S. I will mark this as solution. — YoYoYo, Sep 27 '22 at 12:45

Regex to match URLs with and without domain name in python?

1 Answers1