0

There is a csv file with following urls inside:

1;https://www.one.de 
2;https://www.two.de 
3;https://www.three.de
4;https://www.four.de
5;https://www.five.de

Then I load it to a pandas dataframe df.

cols = ['nr','url']
df = pd.read_csv("listing.csv", sep=';', encoding = "utf8", dtype=str, names=cols)

Then I like to add another col 'domain_name' corresponding to the nr.

def takedn(url):
    m = urlsplit(url)
    return m.netloc.split('.')[-2]

df['domain_name'] = takedn(df['url'].all())
print(df.head())

But it takes the last domain_name for all nr's.

Output:
  nr                   url domain_name
0  1    https://www.one.de        five
1  2    https://www.two.de        five
2  3  https://www.three.de        five
3  4   https://www.four.de        five
4  5   https://www.five.de        five

I try this to learn vectorizing. It will not work as I think. First line the domain_name should be one, second two and so on.

orgen
  • 170
  • 1
  • 11

2 Answers2

1

To operate on element, you can use apply().

def takedn(url):
    m = urlsplit(url)
    return m.netloc.split('.')[-2]

df['domain_name'] = df['url'].apply(takedn)
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Thanks, perfect answer. Is there a good tutorial or explanation for vectorization? Just for understanding. – orgen Apr 22 '21 at 15:31
  • 1
    @orgen Not know what do you mean by vectorization. But there is book called [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) teaching you how to use pandas to handle data. – Ynjxsjmh Apr 22 '21 at 15:38
  • Found a link here for vectorizing: https://stackoverflow.com/questions/1422149/what-is-vectorization – orgen Apr 22 '21 at 15:55
  • 1
    @orgen That seems unrelated with pandas. But if you mainly want to apply functions on pandas. There are mainly three functions `apply`, `map` and `applymap`. For the difference among them, you can refer to https://stackoverflow.com/questions/19798153. – Ynjxsjmh Apr 22 '21 at 16:02
1

We have built-in function in tldextract

import tldextract
df['domain'] = df.url.map(lambda x : tldextract.extract(x).domain)
df
   nr                   url domain_name domain
0   1    https://www.one.de        five    one
1   2    https://www.two.de        five    two
2   3  https://www.three.de        five  three
3   4   https://www.four.de        five   four
4   5   https://www.five.de        five   five
BENY
  • 317,841
  • 20
  • 164
  • 234