How to Extract Text from Inside " Python

Question

Doing some web scraping and I need to extract the date from inside a HTML tag that looks something like this:

<div class="DateTime" title="Feb 21, 2018 at 1:27 AM">Feb 21</div>

I need to pull out the text at title, as this is the full date.

I have tried:

s = '<div class="DateTime" title="Feb 21, 2018 at 1:27 AM">Feb 21</div>'
l = s.split('"')[1::2]
print l[1]

However I get the error "TypeError: 'NoneType' object is not callable"

Possible duplicate of [Extracting an attribute value with beautifulsoup](https://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup) — Nils Werner, Feb 21 '18 at 09:38
I'm not able to reproduce the error, [see here](http://www.compileonline.com/execute_python_online.php) — ViG, Feb 21 '18 at 09:39

abybaddi009 · Answer 1 · 2018-02-21T09:53:13.017

3

From official documentation:

The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:
tag['id'] 
gives the output: 'boldest'

You can access that dictionary directly as .attrs:

>>> tag.attrs
{u'id': 'boldest'}

And I assume that the variable s is a soup:

s = #tag taken using beautiful soup

you can access the attribute associated with it like this:

s['attribute']

so in your case:

l = s['title']
print(l)

edited Feb 21 '18 at 09:53

answered Feb 21 '18 at 09:45

abybaddi009

1,014
9
22

If that is the case then accept it as an answer for future references. – abybaddi009 Feb 21 '18 at 09:54

Mahesh Karia · Answer 2 · 2018-02-21T09:49:12.933

1

Instead of split I would suggest using regex as follows:

import re
s = '<div class="DateTime" title="Feb 21, 2018 at 1:27 AM">Feb 21</div>'
print re.findall(pattern="title=\"(.*?)\"", string=s)[0]

output

Feb 21, 2018 at 1:27 AM

edited Feb 21 '18 at 09:49

answered Feb 21 '18 at 09:41

Mahesh Karia

2,045
1
12
23

2

This doesn't work with `s = '
Feb 21
`. You can add a `?`to stop at the first `"`, like so: `pattern="title=\"(.*?)\""`. – Thomas Francois Feb 21 '18 at 09:46

score 0 · Answer 3 · answered Feb 21 '18 at 09:41

0

    x =s.split('"')
    print(x[3])

try above

answered Feb 21 '18 at 09:41

Ajet

1
2

1

your answer is perfectly works in given example. What if an another attribute come between ? – Vikas Periyadath Feb 21 '18 at 09:46
regexps are costly so in performance point of view i assumed it would be same – Ajet Feb 21 '18 at 09:55
your answer is right only and its better that regex but only for this example. If OP is trying with another string that may have more attributes your answer wont work. in such situations regex is usefull because it will search and take the value. – Vikas Periyadath Feb 21 '18 at 09:59
The thing that can try with your code is make dict instead of list as attribute name as keys and values by splitting then you can easly take `d['title']` – Vikas Periyadath Feb 21 '18 at 10:01

score 0 · Answer 4 · answered Feb 21 '18 at 09:45

0

Try this

import re
s = '<div class="DateTime" title="Feb 21, 2018 at 1:27 AM">Feb 21</div>'
re.findall(r'title="(.*?)"', s)[0]

You will get

Feb 21, 2018 at 1:27 AM

answered Feb 21 '18 at 09:45

Dixon MD

168
2
11

How to Extract Text from Inside " Python

4 Answers4