0

I am practicing pandas and I have an exercise with which I have a problem

I have an excel file that has a column where two types of urls are stored.

df = pd.DataFrame({'id': [], 
                   'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
   | id | url |
    | -------- | -------------- |
    |     | 'www.something/12312'  |
    |   | 'www.something/12343'    |
    |     | 'www.somethingelse/42312'    | 
    |    | 'www.somethingelse/62343'    | 

I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:

df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312    | 'www.something/12312'  |
| id_12343    | 'www.something/12343'    |
| diffid_42312    | 'www.somethingelse/42312'    | 
| diffid_62343    | 'www.somethingelse/62343'    | 

My problem is how to get only numbers and replace them if that kind of id? I have tried the replace and extract function in pandas

id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
        
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())

However, they are throwing an error TypeError: expected string or bytes-like object.

Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.

sophros
  • 14,672
  • 11
  • 46
  • 75
Paulina
  • 159
  • 10

1 Answers1

3

Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..

>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
         id                      url
0  id_12312      www.something/12312
1  id_12343      www.something/12343
2  id_42312  www.somethingelse/42312
3  id_62343  www.somethingelse/62343

Or using str.extract

>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')
Danail Petrov
  • 1,875
  • 10
  • 12
  • Thank you. The prefix is supposed to be different for a different web page, so when I have a webpage somethingelse the prefix is diffid_, but when I have webpage something the prefix is id_ – Paulina May 27 '21 at 12:30
  • Thank I managed to solve it for prefix too thanks to your help :) ```df['id_num'] = df.url.str.extract(r'/(\d+)$').astype(str) ``` ```df['id_prefix'] = np.where((df['url'].str.contains('somethingelse')), 'diffid_', 'id_') ``` ```df['id'] = df['id_prefix'] + df['id_num']``` – Paulina May 27 '21 at 12:38