Dataframe slicing from cell after reading csv

Question

I am reading data from Twitter analytics with CSV and DataFrames.

I want to extract url from certain cell

The output is this process is the following

tweet number tweet id               tweet link              tweet text
1            1.0086341313026E+018   "tweet link goes here"  tweet text goes here https://example.com"

How can I slice this "tweet text" to get the url of it? I cannot slice it using [-1:-12] because there are many tweets with different characters number.

jezrael · Accepted Answer · 2018-06-03T14:16:59.247

3

I believe that you want:

print (df['tweet text'].str[-12:-1])
0    example.com
Name: tweet text, dtype: object

More general solution is with regex with str.findall for list of all links and if necessary select first by indexing with str[0]:

pat = r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'

print (df['tweet text'].str.findall(pat).str[0])
0    https://example.com
Name: tweet text, dtype: object

edited Jun 03 '18 at 14:16

answered Jun 03 '18 at 13:44

jezrael

822,522
95
1,334
1,252

This is exactly what I needed. Thank you. – tarek hassan Jun 03 '18 at 14:09

score 3 · Answer 2 · answered Jun 03 '18 at 13:54

Here's one way which uses a list of strings and pd.Series.apply for finding a valid URL:

s = pd.Series(['tweet text goes here https://example.com',
               'some http://other.com example',
               'www.thirdexample.com is here'])

test_strings = ['http', 'www']

def url_finder(x):
    return next(i for i in x.split() if any(t in i for t in test_strings))

res = s.apply(url_finder)

print(res)

0     https://example.com
1        http://other.com
2    www.thirdexample.com
dtype: object

sjw · Answer 3 · 2018-06-03T14:12:18.093

2

Here's an alternative that will work if the domain name length is variable, rather than always 11 characters long:

In [2]: df['tweet text'].str.split('//').str[-1]

Out[2]:
1    example.com
Name: tweet text, dtype: object

edited Jun 03 '18 at 14:12

answered Jun 03 '18 at 13:52

sjw

6,213
2
24
39

1

better is `df['tweet text'].str.split('//').str[-1])` – jezrael Jun 03 '18 at 14:09
Thanks, thought there must have been a better way than apply but couldn't find it, will edit. – sjw Jun 03 '18 at 14:11

Dataframe slicing from cell after reading csv

3 Answers3