Search and extract a URL from a text file

Question

I'm looking to grab a url that begins with http:// or https:// from a textfile that also contains other unrelated text and transfer it to another file/list.

    def test():
        with open('findlink.txt') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                if "https://" in line:
                    outfile.write(line[line.find("https://"): line.find("")])
            print("Done")

The code currently does nothing.

Edit: I see this is being negatively voted like usual, is there anything I can add here?

This is not a duplicate, please re-read carefully.

What is expected out of this `outfile.write(line[line.find("https://"): line.find("")])`? — mad_, Feb 05 '19 at 21:13
It is expected to separate the URL from other unrelated text. Picture a file with contents like this `lorem ipsum https://stackoverflow.com/questions/54543095/search-and-extract-a-url-from-a-text-file dolor sit amet` There may or may not be text written after the URL so `line.find(" ")` would not be useful here. — Dann, Feb 05 '19 at 21:15
The second part of your slice `line.find("")` this returns `0` that will completely mess up the slice. use [re](https://docs.python.org/3/library/re.html) — Jab, Feb 05 '19 at 21:16
Yes @Jaba, I'm looking for the proper solution to fix that. Leaving that out won't return only the URL like needed. — Dann, Feb 05 '19 at 21:17
Why don't you replace it with `line.find(' ')`? Or replace the entire line with `outfile.write(line.split()[0])`? — Jordan Singer, Feb 05 '19 at 21:18
Possible duplicate of [Extracting a URL in Python](https://stackoverflow.com/questions/839994/extracting-a-url-in-python) — mad_, Feb 05 '19 at 21:19
Possible duplicate of [How do you extract a url from a string using python?](https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python) — Jab, Feb 05 '19 at 21:20
It certainly looks like these posts solve your problem, @Dansey — Jordan Singer, Feb 05 '19 at 21:21
@JordanSinger, I've searched through them, they did not solve my problem. — Dann, Feb 05 '19 at 21:24
Did you try them? They both seem to be answering a generalized version of your question. — Jordan Singer, Feb 05 '19 at 21:25
@JordanSinger, I had not tried jaba's link, I will try that now! It looks like that could be the solution. — Dann, Feb 05 '19 at 21:27
Please also reference @mad_ 's SO link, as it amounts to exactly the same solution. — Jordan Singer, Feb 05 '19 at 21:29

Osman Mamun · Answer 1 · 2019-02-05T21:34:08.133

2

You can use re to extract all the url.

In [1]: st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov
   ...: h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'''

In [2]: st
Out[2]: 'https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'

In [3]: import re

In [4]: a = re.compile(r"https*://(\w+\.\w{3})/*")
In [5]: for i in a.findall(st):
   ...:     print(i)


regex101.com
regex202.gov
regex303.com
regex101.com

For variable tld and path:

st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/ ie fah fah http://regex101.co/ ty ahn fah jaio l http://regex101/yhes.com/'''
a = re.compile(r"https*://([\w/]+\.\w{0,3})/*")
for i in a.findall(st):
    print(i)

regex101.com
regex202.gov
regex303.com
regex101.com
regex101.co
regex101/yhes.com

edited Feb 05 '19 at 21:34

answered Feb 05 '19 at 21:21

Osman Mamun

2,864
1
16
22

Would this also work for URLs that have a path? All URLs being used here will contain a path and a TLD that may or may not have 3 characters. – Dann Feb 05 '19 at 21:23
1

for tld, you can use {0,3} to have no characters upto 3 characters. For path, you can include path separator in the group `/*` – Osman Mamun Feb 05 '19 at 21:26
Also it would help if you include some examples of url you are extracting. – Osman Mamun Feb 05 '19 at 21:28

Jab · Accepted Answer · 2019-02-05T23:11:53.840

1

You need to use re like in this answer. Below is this incorperated into your function.

def test():
        with open('findlink.txt', 'r') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                try:
                    url = re.search("(?P<url>https?://[^\s]+)", line).group("url")
                    outfile.write(url)
                except AttributeError:
                    pass
            print("Done")

edited Feb 05 '19 at 23:11

answered Feb 05 '19 at 21:24

Jab

26,853
21
75
114

Thank you. This solves the issue without worrying about specific TLDs or lengths! – Dann Feb 05 '19 at 22:40
Just a note so you can edit your solution, this returns an attribute error due to `group` if `re.search` doesn't return a url. – Dann Feb 05 '19 at 23:08
@Dansey Edited my answer – Jab Feb 05 '19 at 23:12
I know this is weeks later but does .group serve a purpose? could `?P` and `.group("url")` be removed to make a simple re search? @Jab – Dann Feb 23 '19 at 04:35
1

No, re.search returns either `None` or [`re.MatchObject`](https://docs.python.org/2/library/re.html#re.MatchObject) as per the docs. Read there and see your options. – Jab Feb 23 '19 at 05:34

score -1 · Answer 3 · answered Feb 05 '19 at 21:21

-1

Here's why the code currently does nothing:

outfile.write(line[line.find("https://"): line.find("")])

Note that line.find("") is looking for the empty string. This is always going to be found at the very beginning of the string, and therefore will always return 0. Thus your list slice is 0 elements long and thus is empty.

Try changing it to line.find(" ") - you're looking for a space, not an empty string.

However, if the line contains spaces before that point, you're still going to mess up. The simplest-to-read way to do it is probably just using separate variables:

if "https://" in line:
    https_begin = line.find("https://")
    https_end = line[https_begin:].find(" ")  # find the next space after the url begins
    outfile.write(line[https_begin: https_end])

answered Feb 05 '19 at 21:21

Green Cloak Guy

23,793
4
33
53

See comments. I'm not looking for a space as infile may or may not contain a space after the url. – Dann Feb 05 '19 at 21:22
That's not a problem though. Then the `find(' ')` would return a `-1` and you're good to go. – Jordan Singer Feb 05 '19 at 21:23
A possible solution using this would be to add a space to the end of findlink.txt regardless of its contents. – Dann Feb 05 '19 at 21:31

Search and extract a URL from a text file

3 Answers3