0

So I have the following text example:

Good Morning,

The link to your exam is https://uni.edu?hash=89234rw89yfw8fw89ef .Please complete it within the stipulated time.

If you have any issue, please contact us
https://www.uni.edu
https://facebook.com/uniedu

And what I want is to extract the url of the exam link: https://uni.edu?hash=89234rw89yfw8fw89ef . I'm planning to use the findAll() function but I'm having difficulties writing the regex to extract the specific url.

import re

def find_exam_url(text_file):
    filename = open(text_file, "r")
    new_file = filename.readlines()
    word_lst = []

    for line in new_file:
        exam_url = re.findall('https?://', line) #use regex to extract exam url
    return exam_url

if __name__ == "__main__":
   print(find_exam_url("mytextfile.txt"))

The output i get is:

['http://']

Instead of:

https://uni.edu?hash=89234rw89yfw8fw89ef

Would appreciate some help on this.

Maxxx
  • 3,688
  • 6
  • 28
  • 55
  • Please, check the following thread: https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url – Ilya Jan 14 '21 at 03:27
  • You can use `https://uni\.edu\?\S+` or a broader variant `https?://[^\s?]+\?\S+` https://regex101.com/r/QHWgRk/1 – The fourth bird Jan 14 '21 at 08:41

1 Answers1

0

This regex works:

>>> re.findall('(https?://.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef',
 'https://www.uni.edu',
 'https://facebook.com/uniedu']

where s represents the text in your file (read by f.read()) and the pattern used is (https?://.*?)\s (lazy match until whitespace occurs).

If you need to extract the url mentioned as an exam link, you can make the regex more specific:

>>> re.findall('exam.*(https?://.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef']

Or it seems like the exam link would have an identifier/URL parameter in the form of ?hash=, so something like this is better

>>> re.findall('(https?://.*\?hash=.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef']
Jarvis
  • 8,494
  • 3
  • 27
  • 58
  • sorry but i didn't want the second and third url, "https://www.uni.edu", 'https://facebook.com/uniedu' but rather just 'https://uni.edu?hash=89234rw89yfw8fw89ef' – Maxxx Jan 14 '21 at 03:38
  • 1
    yes thank you! sorry i forgot to accept it as an answer to my question. – Maxxx Jan 14 '21 at 14:48