I'm working with some pretty dirty text data that is currently in a csv and I'm looking to to extract URLs from it to do some processing. I have already referenced a few sources here on SO on how to more intelligently extract the urls. I started regex more similar to the top answer here enter link description here and am currently working with some regex similar to the answer in this post.
However, neither of these take into account when a url has some unwanted text directly after it without any spaces. Take for example "https://example.com/exampleurl," or "https://example.com/exampleurl|unwantedtext". The two unwanted examples here being "," and "|unwantedtext".
I am looking to edit my regex to be able to detect valid urls more accurately. If the url isn't of a valid format I'd like to mark that. I'm currently working within a pandas dataframe so when I extract the urls it looks something like this:
def parseURL(data):
URLpattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)'
data['parsed_url'] = data['free_text'].str.extract(URLpattern)
This will successfully get me "https://example.com" out of the string "I like https://example.com a lot" but it won't successfully retrieve it successfully from "I like https://example.com, a lot"