0

I have an anchor tag as follows:

<a class="gsc_a_at" href= "/citations?view_op=view_citation&amp;hl=en&amp;user=11JgipcAAAAJ&amp;pagesize=100&amp;citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C">'''

I want to extract the content after the citation_for_view using beautifulSoup. How can I do it without regular expressions.

Below is what I tried.

input_data = ''' '''

#!/usr/bin/python
from bs4 import BeautifulSoup

soup = BeautifulSoup(input_data)

for href_tags in soup.find_all('a',href=True):
    print href_tags['href']

This outputs:

/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C

How can I extract the content of citation_for_view which is within href and output just 11JgipcAAAAJ:j3f4tGmQtD8C

Shapi
  • 5,493
  • 4
  • 28
  • 39
kingmakerking
  • 2,017
  • 2
  • 28
  • 44

1 Answers1

2

You can use urlparse

>>> import urlparse

>>> url = '/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C'
>>> vals = urlparse.parse_qs(url)
>>> print vals.get('citation_for_view')
['11JgipcAAAAJ:j3f4tGmQtD8C']
ahmed
  • 5,430
  • 1
  • 20
  • 36