1

I need to use python to match url in my text file. However, there is a special case:

i like pic.twitter.com/Sex8JaP5w5/a7htvq

In this case I would like to keep the emoji next to the url and just match the url in the middle.

Ideally, I would like to have result like this:

i like <url>

Since I am new to this, this is what I have so far.

pattern = re.compile("([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])")

but the return result is something unsatisfied like this:

i like <url>Sex8JaP5w5/a7htvq

Would you please help me with this? Thank you so much

liaoming999
  • 769
  • 8
  • 14

2 Answers2

0

If looks like you are missing * or+ at the last matching group so it only matches one character. So you want "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])*" or "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])+".

Now I don't know if this regex is simplified for your case, but it does not match all urls. For an example of that check out https://www.regextester.com/20

If you are attempting to match any url I would recommend rethinking your problem and trying to simplify down to more specific types of urls, like the example you provided.

EDIT: Also why (.com)+? Is there really a case where multiple ".com"s appear like .com.com.com

Also I think you have small typo and it is supposed to be (\.com). But since you have ([:///a-zA-Z////\.])+ it could be reduced to (com), however i think the explicit (\.com) makes it an easier expression to read.

Garrigan Stafford
  • 1,331
  • 9
  • 20
0

A solution using existing packages:

from urlextract import URLExtract
import emoji

def remove_emoji(text):
  return emoji.get_emoji_regexp().sub(r'', text)

extractor = URLExtract()
source = "i like pic.twitter.com/Sex8JaP5w5/a7htvq "
urlsWithEmojis = extractor.find_urls(source)
urls = list(map(remove_emoji, urlsWithEmojis))
print(urls)

output

['pic.twitter.com/Sex8JaP5w5/a7htvq']

Try it Online!

Inspired by How do you extract a url from a string using python? and removing emojis from a string in Python

aloisdg
  • 22,270
  • 6
  • 85
  • 105