Extracting specific psuedo tag inside href element using Beautiful Soup

Question

I have an anchor tag as follows:

<a class="gsc_a_at" href= "/citations?view_op=view_citation&amp;hl=en&amp;user=11JgipcAAAAJ&amp;pagesize=100&amp;citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C">'''

I want to extract the content after the citation_for_view using beautifulSoup. How can I do it without regular expressions.

Below is what I tried.

input_data = ''' '''

#!/usr/bin/python
from bs4 import BeautifulSoup

soup = BeautifulSoup(input_data)

for href_tags in soup.find_all('a',href=True):
    print href_tags['href']

This outputs:

/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C

How can I extract the content of citation_for_view which is within href and output just 11JgipcAAAAJ:j3f4tGmQtD8C

score 2 · Accepted Answer · answered Oct 02 '15 at 17:42

You can use urlparse

>>> import urlparse

>>> url = '/citations?view_op=view_citation&hl=en&user=11JgipcAAAAJ&pagesize=100&citation_for_view=11JgipcAAAAJ:j3f4tGmQtD8C'
>>> vals = urlparse.parse_qs(url)
>>> print vals.get('citation_for_view')
['11JgipcAAAAJ:j3f4tGmQtD8C']

Extracting specific psuedo tag inside href element using Beautiful Soup

1 Answers1