0

I'm looking to grab a url that begins with http:// or https:// from a textfile that also contains other unrelated text and transfer it to another file/list.

    def test():
        with open('findlink.txt') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                if "https://" in line:
                    outfile.write(line[line.find("https://"): line.find("")])
            print("Done")

The code currently does nothing.

Edit: I see this is being negatively voted like usual, is there anything I can add here?

This is not a duplicate, please re-read carefully.

Dann
  • 159
  • 6
  • What is expected out of this `outfile.write(line[line.find("https://"): line.find("")])`? – mad_ Feb 05 '19 at 21:13
  • It is expected to separate the URL from other unrelated text. Picture a file with contents like this `lorem ipsum https://stackoverflow.com/questions/54543095/search-and-extract-a-url-from-a-text-file dolor sit amet` There may or may not be text written after the URL so `line.find(" ")` would not be useful here. – Dann Feb 05 '19 at 21:15
  • The second part of your slice `line.find("")` this returns `0` that will completely mess up the slice. use [re](https://docs.python.org/3/library/re.html) – Jab Feb 05 '19 at 21:16
  • Yes @Jaba, I'm looking for the proper solution to fix that. Leaving that out won't return only the URL like needed. – Dann Feb 05 '19 at 21:17
  • Why don't you replace it with `line.find(' ')`? Or replace the entire line with `outfile.write(line.split()[0])`? – Jordan Singer Feb 05 '19 at 21:18
  • Possible duplicate of [Extracting a URL in Python](https://stackoverflow.com/questions/839994/extracting-a-url-in-python) – mad_ Feb 05 '19 at 21:19
  • Nope, it's an entirely different dilemma @mad_ – Dann Feb 05 '19 at 21:19
  • 2
    Possible duplicate of [How do you extract a url from a string using python?](https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python) – Jab Feb 05 '19 at 21:20
  • It certainly looks like these posts solve your problem, @Dansey – Jordan Singer Feb 05 '19 at 21:21
  • @JordanSinger, I've searched through them, they did not solve my problem. – Dann Feb 05 '19 at 21:24
  • Did you try them? They both seem to be answering a generalized version of your question. – Jordan Singer Feb 05 '19 at 21:25
  • @JordanSinger, I had not tried jaba's link, I will try that now! It looks like that could be the solution. – Dann Feb 05 '19 at 21:27
  • Please also reference @mad_ 's SO link, as it amounts to exactly the same solution. – Jordan Singer Feb 05 '19 at 21:29

3 Answers3

2

You can use re to extract all the url.

In [1]: st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov
   ...: h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'''

In [2]: st
Out[2]: 'https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'

In [3]: import re

In [4]: a = re.compile(r"https*://(\w+\.\w{3})/*")
In [5]: for i in a.findall(st):
   ...:     print(i)


regex101.com
regex202.gov
regex303.com
regex101.com

For variable tld and path:

st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/ ie fah fah http://regex101.co/ ty ahn fah jaio l http://regex101/yhes.com/'''
a = re.compile(r"https*://([\w/]+\.\w{0,3})/*")
for i in a.findall(st):
    print(i)

regex101.com
regex202.gov
regex303.com
regex101.com
regex101.co
regex101/yhes.com
Osman Mamun
  • 2,864
  • 1
  • 16
  • 22
  • Would this also work for URLs that have a path? All URLs being used here will contain a path and a TLD that may or may not have 3 characters. – Dann Feb 05 '19 at 21:23
  • 1
    for tld, you can use {0,3} to have no characters upto 3 characters. For path, you can include path separator in the group `/*` – Osman Mamun Feb 05 '19 at 21:26
  • Also it would help if you include some examples of url you are extracting. – Osman Mamun Feb 05 '19 at 21:28
1

You need to use re like in this answer. Below is this incorperated into your function.

def test():
        with open('findlink.txt', 'r') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                try:
                    url = re.search("(?P<url>https?://[^\s]+)", line).group("url")
                    outfile.write(url)
                except AttributeError:
                    pass
            print("Done")
Jab
  • 26,853
  • 21
  • 75
  • 114
  • Thank you. This solves the issue without worrying about specific TLDs or lengths! – Dann Feb 05 '19 at 22:40
  • Just a note so you can edit your solution, this returns an attribute error due to `group` if `re.search` doesn't return a url. – Dann Feb 05 '19 at 23:08
  • @Dansey Edited my answer – Jab Feb 05 '19 at 23:12
  • I know this is weeks later but does .group serve a purpose? could `?P` and `.group("url")` be removed to make a simple re search? @Jab – Dann Feb 23 '19 at 04:35
  • 1
    No, re.search returns either `None` or [`re.MatchObject`](https://docs.python.org/2/library/re.html#re.MatchObject) as per the docs. Read there and see your options. – Jab Feb 23 '19 at 05:34
-1

Here's why the code currently does nothing:

outfile.write(line[line.find("https://"): line.find("")])

Note that line.find("") is looking for the empty string. This is always going to be found at the very beginning of the string, and therefore will always return 0. Thus your list slice is 0 elements long and thus is empty.

Try changing it to line.find(" ") - you're looking for a space, not an empty string.


However, if the line contains spaces before that point, you're still going to mess up. The simplest-to-read way to do it is probably just using separate variables:

if "https://" in line:
    https_begin = line.find("https://")
    https_end = line[https_begin:].find(" ")  # find the next space after the url begins
    outfile.write(line[https_begin: https_end])
Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
  • See comments. I'm not looking for a space as infile may or may not contain a space after the url. – Dann Feb 05 '19 at 21:22
  • That's not a problem though. Then the `find(' ')` would return a `-1` and you're good to go. – Jordan Singer Feb 05 '19 at 21:23
  • A possible solution using this would be to add a space to the end of findlink.txt regardless of its contents. – Dann Feb 05 '19 at 21:31