How to select an entire entity around a regex without splitting the string first?

Question

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:

https:// twitter.com/username/sta tus/ID

After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:

tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);

I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like

http website strangeTLD .... communication

It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.

Specifically, is there a way to select the entity surrounding/after:

pic.twitter.com/

or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...

http.*?twitter.com/*?/sta tus/

Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.

lost in your description... can you summarize the bottomline ? give few example input string and expected output as well — Mustofa Rizwan, Apr 22 '18 at 07:49
No very clear... Are the spaces present in the initial dataset or did they appear after your first processing steps? — sciroccorics, Apr 22 '18 at 08:25
@RizwanM.Tuman they were present in the initial dataset that I was given; I initially didn't notice and am not trying to come back and accommodate for them. Here's an example of a URL in a tweet that isn't stripped because of a space: https:// twitter.com/pappiness/stat us/919752795280027648 — Claire, Apr 22 '18 at 10:50
Did you check my solution? Wasn't it clear or off by something? — Francesco B., Apr 22 '18 at 12:47
@FrancescoB. I have not tried it yet! I am meeting soon to try and get the original dataset... see if they can look how they retrieved it and eliminate the space problem altogether. If not, I will proceed with the solution and let you know how it goes! — Claire, Apr 23 '18 at 07:52
great; anyway the solution below works with blanks as well, insert them in the regex and in `currentText` as needed — Francesco B., Apr 24 '18 at 06:37

Francesco B. · Answer 1 · 2018-04-22T08:39:06.687

Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.

E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use

(?<=https:\/\/twitter\.com\/username\/).*

and you will get status/ID, like you can see with this live demo.

In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.

What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).

Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths. So you can just skip www.twitter.com/

(?<=https:\/\/twitter\.com\/).*

And then, via Python, create a substring

currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID

Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).

As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

How to select an entire entity around a regex without splitting the string first?

1 Answers1