0

I have a list with several text items in it. Some of these items have URLs and I want to extract only these URLs.

my_list = ['ok', 'thanks, here we go: https://www.example.com', 'http://example.org']

I want to show just the URL, for example:

my_new_list = ['https://example.com', 'http://example.org']

I managed to create a for loop to show only the items that have a URL, but the return still brings the rest of the text inside the list. For example my_new_list = ['thanks, here we go: https://www.example.com']

Edit: Clarify that I want to do that without modules.

2 Answers2

1

Alternative instead of re with list comprehensions:

for j in [element for element in my_list if "http" in element]:
    [print(k) for k in j.split(" ") if k.startswith("http")]

Output:

https://www.example.com
http://example.org
Julio Reckin
  • 146
  • 1
  • 10
  • Yes! Thanks for your answer. Could you explain to me the logic behind it? Why this part is between brackets? `[element for element in my_list if "http" in element]` – interferemadly Aug 26 '21 at 17:34
  • With this syntax the result is a list of elements which have the string "http". Next step is to iterate over this list to extract only the url part of the string. Here you find further examples and explanations of list comprehensions: [link](https://python-reference.readthedocs.io/en/latest/docs/comprehensions/list_comprehension.html). If the answer helped you, feel free to give me an upvote. – Julio Reckin Aug 26 '21 at 17:44
1

If you want to use regex you can do the following:

import re
my_list = ['ok', 'thanks, here we go: https://www.example.com', 'http://example.org']
final_list = []
for my_string in my_list:
    final_list += re.findall(r'(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))", my_string)

print(final_list)

which gives you ['https://www.example.com', 'http://example.org']. The regex pattern is from here.

But since you specified you don't want to use any modules, you can do the following:

my_list = ['ok', 'thanks, here we go: https://www.example.com', 'http://example.org']
final_list = []
for my_string in my_list:
    if 'http' in my_string:
        final_list.append(my_string[my_string.find('http'):])

print(final_list)

which finds the index of http in each string (if it's there) and gets the text to the right of it. If it is not guaranteed that the website is at the end of each string you can modify the code as follows:

my_list = ['ok', 'thanks, here we go: https://www.example.com', 'http://example.org is a great website']
final_list = []
for my_string in my_list:
    if 'http' in my_string:
        final_list.append(my_string[my_string.find('http'):].split()[0])

print(final_list)

which gives you the same output:

['https://www.example.com', 'http://example.org']
Paulina Khew
  • 397
  • 4
  • 13