extract string betwen two strings in pandas

Question

I have a text column that looks like:

http://start.blabla.com/landing/fb603?&mkw...

I want to extract "start.blabla.com" which is always between:

http://

and:

/landing/

namely:

start.blabla.com

I do:

df.col.str.extract('http://*?\/landing')

But it doesn't work. What am I doing wrong?

Try `http://([^/]+)/landing` – Wiktor Stribiżew Dec 14 '16 at 11:16 — Wiktor Stribiżew, Dec 14 '16 at 11:16
works! can you explain me your solution? @WiktorStribiżew – chopin_is_the_best Dec 14 '16 at 11:17 — chopin_is_the_best, Dec 14 '16 at 11:17
You are missing the capturing parentheses. – AKS Dec 14 '16 at 11:17 — AKS, Dec 14 '16 at 11:17

Wiktor Stribiżew · Answer 1 · 2016-12-14T11:23:40.400

9

Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.

You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with

http://([^/]+)/landing
       ^^^^^^^

where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.

See the regex demo

edited Dec 14 '16 at 11:23

answered Dec 14 '16 at 11:18

Wiktor Stribiżew

607,720
39
448
563

Why not `http:\/\/(.*?)\/landing` ? – Mohammad Yusuf Dec 14 '16 at 12:06
@MohammadYusufGhazi: Because `.*?` causes unnecessary "forward backtracking" (lazy pattern "expansion"), thus requiring more "steps" for the regex engine to return a match compared to a negated character class that just grabs 1+ chars other than `/` (here) and then checks `/landing`. With `.*?` the `/landing` substring is searched for already after matching `http://` - and we know it is not right there. This check repeats after each char in the host value. People often say regex is "overkill" for cases where a simple string method should do. It is the same situation: if `[^/]` will do, use it. – Wiktor Stribiżew Dec 14 '16 at 12:09
Hmm. Interesting. Thanks for the explanation. – Mohammad Yusuf Dec 14 '16 at 12:12
If you mean to ask for more details/examples of how lazy patterns work, see [*Can I improve performance of this regular expression further*](http://stackoverflow.com/a/33869801/3832970). – Wiktor Stribiżew Dec 14 '16 at 12:22
This won't notify the OP. Maybe if you comment below question, it would. – Mohammad Yusuf Dec 27 '16 at 12:41
@MYGz When you try adding 2 user tags in a comment, the popup says: *Only one additional `@user` can be notified; the post owner will always be notified*. Any comment here should notify OP. – Wiktor Stribiżew Dec 27 '16 at 12:44
You have 0 questions or I would have tested it :P – Mohammad Yusuf Dec 27 '16 at 12:46
Did you get a notification for http://stackoverflow.com/questions/41340341/pythonic-way-for-calculating-length-of-lists-in-pandas-dataframe-column? – Wiktor Stribiżew Dec 27 '16 at 12:51
Ok, I added the comment under the question. – Wiktor Stribiżew Dec 27 '16 at 13:17

score 1 · Accepted Answer · answered Dec 14 '16 at 11:28

Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:

df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')

You'd get something along the lines of:

               Site        RestUrl
0  start.blabla.com  fb603?&mkw...

To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

extract string betwen two strings in pandas

2 Answers2