0

So I have a df that has a column full of domains. So for example I have records like this

common_name
www.amazon.com
amazon.com 
subexample.amazon.com
walmart.en
walmart.uk
michigan.edu

I want to use python to extract anything before the last . but before the 1st period if there is one. So the results would look like this.

common_name
amazon
amazon
amazon 
walmart
walmart
michigan

I found some examples of this here but it looks like it was an operator on a string and it was anything before the certain character not between them. The string operator may take awhile to run, so wondering if there is a function using pandas on the whole df by chance?

Alex
  • 211
  • 1
  • 11

3 Answers3

3

This should work:

df['col'] = df['col'].str.rsplit('.', n=1).str[0].str.split('.').str[-1]

Output:

>>> df
           col
0  common_name
1       amazon
2       amazon
3       amazon
4      walmart
5      walmart
6     michigan
  • for these examples tkoutletstore.co.uk tillyandotto.com.au it's returning the middle piece co and com – Alex Dec 22 '21 at 21:17
  • Yeah, it's tricky there. How should we determine which part to take? Maybe drop all the parts that are less than 4 characters long? –  Dec 22 '21 at 21:19
  • yeah that's the part that's got me flustered.. is there a way to iterate the nth element of the . can we take n-1nth element? cause that will always give the correct results no matter if it has a subdomain or main domain. walmart.com com(n) walmart(n-1) www.walmart.com com(n) walmart(n-1) www (n-2) hopefully that logic makes sense. – Alex Dec 22 '21 at 21:21
  • So what are you saying? What do you mean by "nth element of the ."? –  Dec 22 '21 at 21:24
  • so let's say n = the word or character after the last period (.) so in this example (walmart.com) com would be n, and we would want to look before the period and grab that element which is walmart. – Alex Dec 22 '21 at 21:27
  • Okay. So for `tillyandotto.com.au` wouldn't `n` = `au`, so `n - 1` = `com`? –  Dec 22 '21 at 21:32
  • 1
    ah I see what you are saying now.. shoot that's tough. – Alex Dec 22 '21 at 21:34
  • Maybe just split by `.` and keep the longest part? –  Dec 22 '21 at 21:35
  • that is probably a good idea. cause the one below broke on this domain myga.siteip.info it like caused it to go null – Alex Dec 22 '21 at 22:08
  • Try this: `df['col'].str.split('.').explode().groupby(level=0).apply(lambda x: x.iloc[x.str.len().reset_index(drop=True).idxmax()])` –  Dec 22 '21 at 22:13
  • would this be doing the same thing as Sash but with more steps or am I missing something? – Alex Dec 22 '21 at 22:14
  • Yep, you're right. I didn't Sash's good code. –  Dec 22 '21 at 22:16
1

You could use pd.DataFrame.apply along with a lambda function that returns the longest element after splitting (based on comment in richardec's answer):

In [1]: import pandas as pd
In [2]: d = {
   ...:     'domains': [
   ...:         'common_name',
   ...:         'www.amazon.com',
   ...:         'amazon.com',
   ...:         'subexample.amazon.com',
   ...:         'walmart.en',
   ...:         'walmart.uk',
   ...:         'michigan.edu',
   ...:         'tkoutletstore.co.uk',
   ...:         'tillyandotto.com.au',
   ...:     ]
   ...: }
   ...: df = pd.DataFrame(data=d)
   ...: df
Out[2]: 
                 domains
0            common_name
1         www.amazon.com
2             amazon.com
3  subexample.amazon.com
4             walmart.en
5             walmart.uk
6           michigan.edu
7    tkoutletstore.co.uk
8    tillyandotto.com.au
In [3]: df['extracted'] = df['domains'].apply(lambda d: max(d.split('.'), key=len))

In [4]: df
Out[4]: 
                 domains      extracted
0            common_name    common_name
1         www.amazon.com         amazon
2             amazon.com         amazon
3  subexample.amazon.com     subexample
4             walmart.en        walmart
5             walmart.uk        walmart
6           michigan.edu       michigan
7    tkoutletstore.co.uk  tkoutletstore
8    tillyandotto.com.au   tillyandotto
Sash Sinha
  • 18,743
  • 3
  • 23
  • 40
  • although it's not fool proof, this will return most right answers, cause usually the prefix and suffix are pretty small, so this should do what I need it do. Thanks Sash! – Alex Dec 22 '21 at 22:11
1

Pandas won't make things any faster computation-wise. This regex might work for you:

s.str.extract(r'(\w+)(\.\w{2,3})+$')[0]

But a better solution would be: Extract domain from URL in python

tozCSS
  • 5,487
  • 2
  • 34
  • 31
  • this is great! I wish I knew more regex to take advantage of it more, yeah so for the cases above it looked good, I will comment here if I see any other mishaps but I think this should solve the use case. – Alex Dec 22 '21 at 21:37
  • shoot it broke on this domain myga.siteip.info, it just made the whole thing null after applying the regex – Alex Dec 22 '21 at 22:09
  • 1
    changing `\w{2,3}` to `\w{2,4}` should fix it because `info` is 4 letters. – tozCSS Dec 22 '21 at 22:17
  • yeah I think I will probably take the longest in Sash's answer, cause I'm not sure what the suffix will be it could be .com, .net, .info, or .domain etc. Thanks though! very helpful! – Alex Dec 22 '21 at 22:19