4

I have a text column that looks like:

http://start.blabla.com/landing/fb603?&mkw...

I want to extract "start.blabla.com" which is always between:

http://

and:

/landing/

namely:

start.blabla.com

I do:

df.col.str.extract('http://*?\/landing')

But it doesn't work. What am I doing wrong?

Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
chopin_is_the_best
  • 1,951
  • 2
  • 23
  • 39

2 Answers2

9

Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.

You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with

http://([^/]+)/landing
       ^^^^^^^

where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.

See the regex demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Why not `http:\/\/(.*?)\/landing` ? – Mohammad Yusuf Dec 14 '16 at 12:06
  • @MohammadYusufGhazi: Because `.*?` causes unnecessary "forward backtracking" (lazy pattern "expansion"), thus requiring more "steps" for the regex engine to return a match compared to a negated character class that just grabs 1+ chars other than `/` (here) and then checks `/landing`. With `.*?` the `/landing` substring is searched for already after matching `http://` - and we know it is not right there. This check repeats after each char in the host value. People often say regex is "overkill" for cases where a simple string method should do. It is the same situation: if `[^/]` will do, use it. – Wiktor Stribiżew Dec 14 '16 at 12:09
  • Hmm. Interesting. Thanks for the explanation. – Mohammad Yusuf Dec 14 '16 at 12:12
  • If you mean to ask for more details/examples of how lazy patterns work, see [*Can I improve performance of this regular expression further*](http://stackoverflow.com/a/33869801/3832970). – Wiktor Stribiżew Dec 14 '16 at 12:22
  • This won't notify the OP. Maybe if you comment below question, it would. – Mohammad Yusuf Dec 27 '16 at 12:41
  • @MYGz When you try adding 2 user tags in a comment, the popup says: *Only one additional `@user` can be notified; the post owner will always be notified*. Any comment here should notify OP. – Wiktor Stribiżew Dec 27 '16 at 12:44
  • You have 0 questions or I would have tested it :P – Mohammad Yusuf Dec 27 '16 at 12:46
  • Did you get a notification for http://stackoverflow.com/questions/41340341/pythonic-way-for-calculating-length-of-lists-in-pandas-dataframe-column? – Wiktor Stribiżew Dec 27 '16 at 12:51
  • Ok, I added the comment under the question. – Wiktor Stribiżew Dec 27 '16 at 13:17
1

Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:

df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')

You'd get something along the lines of:

               Site        RestUrl
0  start.blabla.com  fb603?&mkw...

To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

Julien Marrec
  • 11,605
  • 4
  • 46
  • 63