2

I need to extract URLs from a column of DataFrame which was created using following values

creation_date,tweet_id,tweet_text
2020-06-06 03:01:37,1269102116364324865,#Webinar: Sign up for @SumoLogic's June 16 webinar to learn how to navigate your #Kubernetes environment and unders… https://stackoverflow.com/questions/42237666/extracting-information-from-pandas-dataframe
2020-06-06 01:29:38,1269078966985461767,"In this #webinar replay, @DisneyStreaming's @rothgar chats with @SumoLogic's @BenoitNewton about how #Kubernetes is… https://stackoverflow.com/questions/46928636/pandas-split-list-into-columns-with-regex

column name tweet_text contains URL. I am trying following code.

df["tweet_text"]=df["tweet_text"].astype(str)
pattern = r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

df['links'] = ''
df['links']= df["tweet_text"].str.extract(pattern, expand=True)

print(df)

I am using regex from answer of this question and it matches URL in both rows.screenshot But I am getting NaN as values of new column df['links]'. I have also tried solution provided in first answer of this question, which was

df['links']= df["tweet_text"].str.extract(pattern, expand=False).str.strip()

But I am getting following error

AttributeError: 'DataFrame' object has no attribute 'str'

Lastly I created an empty column using df['links'] = '', because I was getting ValueError: Wrong number of items passed 2, placement implies 1 error. If that's relevant. Can someone help me out here?

Raza Ul Haq
  • 342
  • 3
  • 15
  • 1
    Your URL pattern is not quite clean, but the main problem is that it contains *capturing* groups where you need *non-capturing* ones. You need to wrap it with a capturing group, `pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)'` – Wiktor Stribiżew Jun 06 '20 at 09:56
  • It worked thank you, can you move this comment to answers so I can mark it. – Raza Ul Haq Jun 06 '20 at 09:58

1 Answers1

6

The main problem is that your URL pattern contains capturing groups where you need non-capturing ones. You need to replace all ( with (?: in the pattern.

However, it is not enough since str.extract requires a capturing group in the pattern so that it could return any value at all. Thus, you need to wrap the whole pattern with a capturing group.

You may use

pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)' 

Note the + is not necessary to escape inside a character class. Also, there is no need to use // inside a character class, one / is enough.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563