-2

Im trying to search a .txt file and return any objects found that match my criteria. I would like to get the entire line and place the urls in a set or list.

What is the best way to search the txt file and return objects?

Here is what I have so far:

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Start loop. Only return possible url links.
for line in target_open:
     if '.' in line:
        print(target_open.readline())

And here is the sample .txt file: This is a file:

Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com

Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion

How to find a link in the middle of line youtube.com for example
  • This sounds like a problem that is solvable by regular expressions. Could you please add the .txt file that you're searching the objects in, which kinds of objects you're searching and how you tried to solve this problem so far? Thank you. – Michael Ostrovsky Feb 06 '19 at 02:09
  • open the file;iterate over the lines;look for a url pattern in each line; save lines that have urls in a container;ensure the file is closed. – wwii Feb 06 '19 at 02:48
  • I added details to the post. Any help is appreciated. I think I'm close.... – Shea Onstott Feb 07 '19 at 16:59

1 Answers1

0

Unless you have any restrictions that require you to parse the urls manually rather than using built-in python libraries, the re can be helpful to accomplish this.

Using an answer from Regular expression to find URLs within a string

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Read the text from the file
text = target_open.read()

# import regex module
import re

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)

print(urls)

result:

['ESPN.com', 'python.org', 'mywebsite.net', 'firstname.wtf', 'creepy.onion', 'youtube.com']

Unfortunately, searching if '.' in line: will match on punctuation like urls. The, python. Here and another.

Python's regex module helps specify the pattern of url syntax so only urls are matched and not sentence punctuation.

Hope this helps.