Remove a URL row by row from a large set of text in python panda dataframe

Question

i have inserted data into pandas dataframe. like the picture suggest as you can see there are some rows that contain url links i want to remove all the url links and replace them with " " (nothing just wiping it ) as you can see row 4 has a url there are other rows too that have url. i want to go through all the rows in the status_message column find any url and remove them. i've been looking at this How to remove any URL within a string in Python but am not sure how to use to it on the dataframe. so row 4 should like vote for labour register now.

jezrael · Answer 1 · 2017-07-30T04:10:10.973

8

You can use str.replace with case=False parameter:

df = pd.DataFrame({'status_message':['a s sd Www.labour.com',
                                    'httP://lab.net dud ff a',
                                     'a ss HTTPS://dd.com ur o']})
print (df)
             status_message
0     a s sd Www.labour.com
1   httP://lab.net dud ff a
2  a ss HTTPS://dd.com ur o

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)
print (df)
  status_message
0        a s sd 
1       dud ff a
2     a ss  ur o

edited Jul 30 '17 at 04:10

answered Jul 30 '17 at 04:09

jezrael

822,522
95
1,334
1,252

1

Yes, very similar, only one difference there is - `case=False` for case insensitive. – jezrael Jul 30 '17 at 04:11
1

plus one for `case = False` – Bharath M Shetty Jul 30 '17 at 04:12

score 2 · Answer 2 · answered Jul 30 '17 at 04:08

You can use .replace() with regex to do that i.e

df = pd.DataFrame({'A':['Nice to meet you www.xy.com amazing','Wow https://www.goal.com','Amazing http://Goooooo.com']})
df['A'] = df['A'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Output :

                           A
0  Nice to meet you amazing
1                       Wow 
2                   Amazing

Gayatri · Answer 3 · 2017-08-02T04:29:23.627

0

I think you could do something simple as

for index,row in data.iterrows():
    desc = row['status_message'].lower().split()
    print ' '.join(word for word in desc if not word.startswith(('www.','http')))

as long as the urls start with "www."

edited Aug 02 '17 at 04:29

answered Jul 30 '17 at 02:36

Gayatri

2,197
4
23
35

score 0 · Answer 4 · answered Jan 03 '19 at 12:48

0

df.status_message = df.status_message.str.replace("www.", "")

answered Jan 03 '19 at 12:48

yasir khatri

91
8

Remove a URL row by row from a large set of text in python panda dataframe

4 Answers4

Linked