1

I am reading data from Twitter analytics with CSV and DataFrames.

I want to extract url from certain cell

The output is this process is the following

tweet number tweet id               tweet link              tweet text
1            1.0086341313026E+018   "tweet link goes here"  tweet text goes here https://example.com"

How can I slice this "tweet text" to get the url of it? I cannot slice it using [-1:-12] because there are many tweets with different characters number.

jpp
  • 159,742
  • 34
  • 281
  • 339
tarek hassan
  • 772
  • 11
  • 35

3 Answers3

3

I believe that you want:

print (df['tweet text'].str[-12:-1])
0    example.com
Name: tweet text, dtype: object

More general solution is with regex with str.findall for list of all links and if necessary select first by indexing with str[0]:

pat = r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'

print (df['tweet text'].str.findall(pat).str[0])
0    https://example.com
Name: tweet text, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Here's one way which uses a list of strings and pd.Series.apply for finding a valid URL:

s = pd.Series(['tweet text goes here https://example.com',
               'some http://other.com example',
               'www.thirdexample.com is here'])

test_strings = ['http', 'www']

def url_finder(x):
    return next(i for i in x.split() if any(t in i for t in test_strings))

res = s.apply(url_finder)

print(res)

0     https://example.com
1        http://other.com
2    www.thirdexample.com
dtype: object
jpp
  • 159,742
  • 34
  • 281
  • 339
2

Here's an alternative that will work if the domain name length is variable, rather than always 11 characters long:

In [2]: df['tweet text'].str.split('//').str[-1]

Out[2]:
1    example.com
Name: tweet text, dtype: object
sjw
  • 6,213
  • 2
  • 24
  • 39