0

I have two case where I am stuck.

CASE 1) input :- < p >This is a sample text. http://sydney.edu.au/ somthing else text.< /p >

Required output :- < p >This is a sample text. < a href="http://sydney.edu.au/">http://sydney.edu.au/< /a > somthing else text.< /p >

CASE 2) input :- < p >This is a sample text. sydney.edu.au/ somthing else text.< /p >

Required output :- < p >This is a sample text. < a href="sydney.edu.au/">sydney.edu.au/< /a > somthing else text.< /p >

I have tried with below piece of code :

>> item = "< p >This is a sample text. http://sydney.edu.au/ somthing else text.< /p >"

>> import re

>> r = re.compile(r"(https?://[^ ]+)")

>> newstr = r.sub(r'<a href="\1">\1</a>', item)

This gives me output for CASE 1 but not for CASE 2. Can anyone find out the way to handle both cases.

Tanveer Alam
  • 5,185
  • 4
  • 22
  • 43
  • What do you want to consider to create a match in case 2? .edu.au domains? anything with first.second.tld? just first.tld? – MatsLindh Jul 07 '14 at 09:21

1 Answers1

0

Your url matching regex seems to be invalid. You can check here for the regex.

If you just want to ignore "https:" then you need to add "?" with parenthesis i.e (https://)? otherwise it will just take 's' in https as optional.

Since https mentioned in the regex and not present in case 2, it is failing for CASE2

EDIT: writing the regex to match all types of urls is very difficult even to understand.

The following regex is simple one and works for both cases.

import re
input = "< p >This is a sample text. sydney.edu.au/ somthing else text.< /p >"
regex = "((?:https?://)?(?:www\.)?[a-zA-Z]+\.[a-z]+[^ ]*)"
re.sub(regex, '<a href=\\1>\\1</a>', input)
Community
  • 1
  • 1
user2109788
  • 1,266
  • 2
  • 12
  • 29