python3.6 How do I regex a url from a .txt?

Question

I need to grab a url from a text file.

The URL is stored in a string like so: 'URL=http://example.net'.

Is there anyway I could grab everything after the = char up until the . in '.net'?

Could I use the re module?

Hey, what do you know. That worked. Thanks for the help my friend! — passwordhash, Sep 14 '19 at 16:13

score 0 · Answer 1 · answered Sep 14 '19 at 16:12

0

i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this

re.findall(r'URL=(.*?).', STRINGNAMEHERE)

Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .

answered Sep 14 '19 at 16:12

gerard Garvey

255
3
8

Hm. Seems to be pulling an empty list... I would print this instead of, for example, using group() like when using re.search(), yes? – passwordhash Sep 14 '19 at 16:16
do you have newline charachters in your string – gerard Garvey Sep 14 '19 at 16:20

score 0 · Accepted Answer · edited Sep 14 '19 at 16:51

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)

print(urls)

output:

[ 
   'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
   'https://www.google.com/',
   'https://www.facebook.com/',
   'https://twitter.com'
]

Marius Mucenicu · Answer 3 · 2019-09-14T17:55:21.187

You don't need RegEx'es (the re module) for such a simple task.

If the string you have is of the form: 'URL=http://example.net'

Then you can solve this using basic Python in numerous ways, one of them being:


file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1  # this gives you the first position after =
end_position = file_line.find('.')

# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]

Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).

Word of advice

This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.

Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.

Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.

score 0 · Answer 4 · answered Sep 14 '19 at 18:40

0

Please try this. It worked for me.

import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

answered Sep 14 '19 at 18:40

Abhishek

11
2

python3.6 How do I regex a url from a .txt?

4 Answers4

Word of advice