1

I need to extract the data within double quotes from a string.

Input:

<a href="Networking-denial-of-service.aspx">Next Page →</a>

Output:

Networking-denial-of-service.aspx

Currently, I am using following method to do this and it is running fine.

atag = '<a href="Networking-denial-of-service.aspx">Next Page →</a>'
start = 0
end = 0

for i in range(len(atag)):
    if atag[i] == '"' and start==0:
        start = i
    elif atag[i] == '"' and end==0: 
        end = i

nxtlink = atag[start+1:end]

So, my question is that is there any other efficient way to do this task.

Thankyou.

dazzieta
  • 662
  • 4
  • 20
  • 3
    There are [regular expressions](https://docs.python.org/2/howto/regex.html) of course, but it's [strongly discouraged](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) to use them for HTML because you can miss a lot of cases. The "right" way to do it would be to use [HTMLParser](https://docs.python.org/2/library/htmlparser.html) (or something on top of that) to parse the HTML and then select the nodes you need and read their attributes. – CherryDT Jul 12 '16 at 11:36
  • @CherryDT Can you please provide a sample code or something. – dazzieta Jul 12 '16 at 11:39
  • I'm not really into python, so I don't feel confident providing an example. That's why this is a comment and not an answer. But what I said (there are regexes, but a HTML parser is preferred) applies to other languages as well. – CherryDT Jul 12 '16 at 11:40
  • Actually, it looks like you might already find the answer here: http://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup – CherryDT Jul 12 '16 at 11:41
  • Thanks for the link, it works. I suggest you to write is as an answer so that I can accept it. – dazzieta Jul 12 '16 at 11:45
  • 1
    Since I'm not sure which of the examples works best for your case and I'm not proficient enough to create a modified version, please add an answer yourself where you describe the final solution and accept it. This way others can still find the answer easily. It's fine that I don't get rep for this one. ^^ – CherryDT Jul 12 '16 at 12:03

2 Answers2

2

You tagged this beautifulsoup so I don't see why you want a regex, if you want the href from all anchors then you can use a css select 'a[href]' which will only find anchor tags that have href attributes:

h = '''<a href="Networking-denial-of-service.aspx">Next Page →</a>'''

soup = BeautifulSoup(h)

print(soup.select_one('a[href]')["href"])

Or find:

 print(soup.find('a', href=True)["href"])

If you have multiple:

for  a in soup.select_one('a[href]'):
    print a["href"]

Or:

for  a in  soup.find_all("a", href=True):
     print a["href"]

You could also specify that you want hrefs that have a leading ":

 soup.select_one('a[href^="]') 
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

I am taking the question exactly as written - how to get data between two double quotes. I agree with the comments that an HTMLParser might be better...

Using regular expression might help, particularly if you want to find more than one. For example, this is a possible set of code

import re
string_with_quotes = 'Some "text" "with inverted commas"\n "some text \n with a line break"'

Find_double_quotes = re.compile('"([^"]*)"', re.DOTALL|re.MULTILINE|re.IGNORECASE) # Ignore case not needed here, but can be useful.

list_of_quotes = Find_double_quotes.findall(string_with_quotes)

list_of_quotes

['text', 'with inverted commas', 'some text \n with a line break']

If you have an odd number of double quotes, then the last double quote is ignored. If none are found, then an empty list is produced.

Various references

http://www.regular-expressions.info/ is really good for learning regular expressions

Regex - Does not contain certain Characters gave me how not to do a character

https://docs.python.org/2/library/re.html#re.MULTILINE tells you what re.MULTILINE and re.DOTALL (underneath) do.

Community
  • 1
  • 1
A. N. Other
  • 392
  • 4
  • 14