1

So for example I have 6 strings as follows:

  • https://twitter.com/test1
  • http://twitter.com/test2
  • https://www.twitter.com/test3?
  • https://www.mobile.twitter.com/test4
  • https://www.twitter.com/test5?lang=en
  • https://www.instagram.com/test1insta

And what I want to do is extract the twitter 'username' from these links. So in this case I would like to search each link with regex to get the username after twitter.com/ and in the cases where the links have a ? for url parameters i would like to get everything before it.

For example it would come out like this:

test1 test2 test3 test4 test5

I have used search to get the pattern but I am struggling with how to get it to just extract the part I want. Here is what I have tried:

username = re.search(r'twitter.com\/(.*)\?', stringsList)

This results in only matching those strings that have a question mark after them which i understand. so just test3 and test5.

I thought I would try making the question mark optional by doing this:

username = re.search(r'twitter.com\/(.*)\??', stringsList)

but instead that just returns all of the usernames with all the additional stuff I want, e.g:

test1 test2 test3? test4 test5?lang=en

But I want it to still extract just the username as group 1 even though the ? should be optional.

What would my regex expression look like for me to do that or do I need to split this up and check if the string has a question mark first and use two different searches based on if its present or not?

I have a test bit of code here

and i've been trying to use this to determine the regex I would like

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
Liam
  • 139
  • 3
  • 11
  • Maybe match all but `?`? Like `r'twitter.com/([^?]*)'`? – Wiktor Stribiżew Dec 16 '22 at 16:49
  • To avoid matching the first part, you can use a lookaround as follows: `(?<=twitter.com\/)[^?\s]+`. Check here >> https://regex101.com/r/67xcAj/1. – lemon Dec 16 '22 at 16:51
  • @WiktorStribiżew Ah yeah that seems to work! I was confused why it wasnt working in my regex101 instance but it think its because its searching them all as one string, in my code that works great! Thank you. – Liam Dec 16 '22 at 16:55
  • @lemon that regex101 doesnt work for me? if i put this string in `https://www.twitter.com/test3?` it matches `test3?` not `test3` – Liam Dec 16 '22 at 16:56
  • Check my answer below and the corresponding demos. @Liam – lemon Dec 16 '22 at 16:56
  • 1
    Is it absolutely necessary to use regex? You could use [`urllib.parse.urlparse`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse) to split the URL into its components. – Pranav Hosangadi Dec 16 '22 at 20:58

2 Answers2

2

You can use a lookaround to avoid matching the first part. Then limit the match on the righthand side to any character other than "?" and spaces.

(?<=twitter.com\/)[^?\s]+

Your python code can be simplified by removing the group catching (username.group(1) becomes username) as follows:

twittercount = 0
NOTtwittercount = 0
for twitterURL in twitterURLs:
    if (twitterURL.twitter_url and 'twitter.com' in twitterURL.twitter_url):
        twittercount += 1
        username = re.search(r'(?<=twitter.com\/)[^?\s]+', twitterURL.twitter_url)
        print("correct twitter link =", twitterURL.twitter_url)
        print("extracted username =", username)
    else:
        NOTtwittercount += 1
        print("incorrect twitter link =", twitterURL.twitter_url)

Regex demo here. Python demo here.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • Realized my regex demo wasn't correctly saved. Now should link to a correct version. – lemon Dec 16 '22 at 17:04
  • I'm sorry i'm really not following? Are you saying i could remove the `'twitter.com' in twitterURL.twitter_url` by using the lookaround? because the code in your answer and in your python demo is not simplified at all its identical except for the regex? additionally again in the python and regex demo when i enter a test such as `https://www.twitter.com/test5?lang=en` it matches `test5?lang=en` which is not what i want? i just want it to match the `test5` part? am i missing something because i dont see the simplification or that this works at all? – Liam Dec 16 '22 at 17:07
  • You're right, your original regex gives `test3?`: python demo link wasn't updated too. Check now. – lemon Dec 16 '22 at 17:08
  • Oh yeah I see I like this, still can't see your python working but putting it in my regex101 works great, do i need username.match now instead or something as it';s a match object? I suppose it doesnt explicitly need to search for `twitter.com/` if im already filtering out, like @Codesidian 's answer? but i suppose the extra check probably doesnt hurt? thanks ill get this in and mark it as the answer in a minute @lemon – Liam Dec 16 '22 at 17:21
  • yeah after googling i need to do `username.group(0)` still to get the match back as a string, doesnt really change anything unless im doing it wrong? – Liam Dec 16 '22 at 17:26
  • To me the linked python fiddle appears to be working correctly (?). How does the behaviour differ from what you expected? – lemon Dec 16 '22 at 18:09
  • For example it prints like this for me? `extracted username = ` and I have to use `username.group(0)` to get the string `test5` ? – Liam Dec 19 '22 at 09:05
  • If you know how to get the match, and you have solved your task, what's the problem with it? – lemon Dec 19 '22 at 10:56
  • 1
    Nothing, I was just curious about it because you said changing the regex to what you put meants you could siimplify the pything code from `username.group(1)` to just `username`but actually it's `username.group(0)`I was merely asking how that is simplified because it's just changed the regex and had no impact on the python. I assumed I was doing something wrong, the solution still stands and is great for what I need so thank you and ill mark it as my correct answer! – Liam Dec 19 '22 at 11:27
1

To be domain agnostic:

(?:https?:\/\/)?(?:[^?\/\s]+[?\/])([a-zA-Z0-9]*)

The username should be in group 1. A modified version from this answer which has a couple of other good methods.

I changed the last filter, doesn't include special characters. If underscores are valid, then you can just add to the last capture group:

(?:https?:\/\/)?(?:[^?\/\s]+[?\/])([a-zA-Z0-9_]*)

or something like this to get everything up to the ?:

(?:https?:\/\/)?(?:[^?\/\s]+[?\/])(.*?)\?
Codesidian
  • 310
  • 2
  • 12
  • Why would I want it to be domain agnostic, is that because im already checking that twitter.com is present in the link before i get to the regex? – Liam Dec 16 '22 at 17:22