To check if it is URL in text and if yes then edit it with href tag Using Python

Question

I have two case where I am stuck.

CASE 1) input :- This is a sample text. http://sydney.edu.au/ somthing else text.

Required output :- This is a sample text. < a href="http://sydney.edu.au/">http://sydney.edu.au/< /a > somthing else text.

CASE 2) input :- This is a sample text. sydney.edu.au/ somthing else text.

Required output :- This is a sample text. < a href="sydney.edu.au/">sydney.edu.au/< /a > somthing else text.

I have tried with below piece of code :

>> item = "< p >This is a sample text. http://sydney.edu.au/ somthing else text.< /p >"

>> import re

>> r = re.compile(r"(https?://[^ ]+)")

>> newstr = r.sub(r'<a href="\1">\1</a>', item)

This gives me output for CASE 1 but not for CASE 2. Can anyone find out the way to handle both cases.

What do you want to consider to create a match in case 2? .edu.au domains? anything with first.second.tld? just first.tld? — MatsLindh, Jul 07 '14 at 09:21

score 0 · Accepted Answer · edited May 23 '17 at 11:57

0

Your url matching regex seems to be invalid. You can check here for the regex.

If you just want to ignore "https:" then you need to add "?" with parenthesis i.e (https://)? otherwise it will just take 's' in https as optional.

Since https mentioned in the regex and not present in case 2, it is failing for CASE2

EDIT: writing the regex to match all types of urls is very difficult even to understand.

The following regex is simple one and works for both cases.

import re
input = "< p >This is a sample text. sydney.edu.au/ somthing else text.< /p >"
regex = "((?:https?://)?(?:www\.)?[a-zA-Z]+\.[a-z]+[^ ]*)"
re.sub(regex, '<a href=\\1>\\1</a>', input)

edited May 23 '17 at 11:57

Community

1
1

answered Jul 07 '14 at 09:19

user2109788

1,266
2
12
29

Can you please guide me for the CASE 2? I am unable to understand the regex Link that you had shared. I am looking for a RE by which I can detect a URL such in CASE 2 i.e. URL without "http" / "https" / "www". – Tanveer Alam Jul 07 '14 at 09:30
1

@TanveerAlam: I have updated the answer. Still if you cant understand the reg ex mentioned there i will explain. just try that out first..! – user2109788 Jul 07 '14 at 09:55
@ user2109788 : Thanks for updating the answer. I tried it with my Input but the RE also changes my < img > tag also. Here is the sample below :-- < img src="/test/images/satellite_online.jpg"/> – Tanveer Alam Jul 07 '14 at 10:07
can you edit the code for handling such adverse results – Tanveer Alam Jul 07 '14 at 10:11
1

@TanveerAlam: Try prepending a blank space to the regex.so it should be like this regex = " ((?:https?://)?(?:www\.)?[a-zA-Z]+\.[a-z]+[^ ]*)" – user2109788 Jul 07 '14 at 10:21
Thanks again bro but this is an another case with is happening after running the script :-- sydney.edu.au/
/> the regular exp is also considering the
tag, can we handle only the text inside any of the HTML tag so that I may ignore the HTML tag to be added in the href tag. – Tanveer Alam Jul 07 '14 at 10:28
1

@TanveerAlam: your statement is not well organized. The string(double quote) is closing before the end of br tag. this can be handle through regex but very difficult. What is your input? and what are you going to do with that? If you want to extract useful contents from a webpage then its better to use modules like BeautifulSoup. They provide much simpler solutions to such problems. You can find it here - http://www.crummy.com/software/BeautifulSoup/bs4/doc/ you can google it for more info on beautiful soup? – user2109788 Jul 07 '14 at 11:24
I am having an HTML content and I want to wrap the URLs in this way URL. This is the basic requirement of mine. But when I am applying the code it also wraps the HTML tags inside the tag. So I tried to extract only text from HTML contents of each tag but I couldnt succeeded. – Tanveer Alam Jul 07 '14 at 11:28
can you comment a case where it fails? before that modify your regex to this. regex = " ((?:https?://)?(?:www\.)?[a-zA-Z]+\.[a-z]{2, 3}/?[^ ]*)" – user2109788 Jul 07 '14 at 11:47
In the sacred offerings, sydney.edu.au/
. This is the case where it is failing and giving output as sydney.edu.au/
/> – Tanveer Alam Jul 07 '14 at 12:06
modify the regex to this and try. regex = " ((?:https?://)?(?:www\.)?[a-zA-Z]+\.[a-z]{2, 3}/?[^ <]*)" I hope you know about regex. You yourself can modify this for the simple cases. but you should know the regex constructs. – user2109788 Jul 07 '14 at 12:47
Sorry sir but I don't know much about the regular expressions, so I am unable to handle simple cases. – Tanveer Alam Jul 09 '14 at 05:12
Can you please update an RE for handling all types of the URLs because the pattern you gave me is not working for those cases where space is not present before the URLs. Also the URL also take the punctuation inside href if the punctuation is present at the end of the URLs so if you handle it also it would be very useful. – Tanveer Alam Jul 09 '14 at 05:16

To check if it is URL in text and if yes then edit it with href tag Using Python

I have two case where I am stuck.

I have tried with below piece of code :

1 Answers1