-2

I am getting the source code of page in one variable.

<!DOCTYPE html><html><head><title>Intro</title></head><body><a href='/name=t1.304.log'>Test</a>.  </body></html>

I want to extract t1.304.log out of above line. I am using print log_name.split(".log",1)[0] but it is fetching me the first whole part.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Aquarius24
  • 1,806
  • 6
  • 33
  • 61
  • Can you elaborate what you mean by extracting the desired string out of the line? Do you want to extract any string that looks like "something.log"? – Leo Sep 27 '15 at 20:42
  • yes any string which ends with .log. and it will come only once – Aquarius24 Sep 27 '15 at 20:43
  • By "only once", do you mean only the first matching substring? Or do you want to make sure the string only contains one match? – Leo Sep 27 '15 at 20:45

4 Answers4

3

Why don't parse the HTML with an HTML parser?

>>> from bs4 import BeautifulSoup
>>> data = "<!DOCTYPE html><html><head><title>Intro</title></head><body><a href='/name=t1.304.log'>Test</a>.  </body></html>"
>>> BeautifulSoup(data).a["href"].split("=")[-1]
't1.304.log'
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

If you just want to do it in a quick way you can use the split() function documented here.

log_name.split("'")[1].split("=")[1]

However to do it in a reusable way look into a tool like beautifulsoup

Edited to add

Based on your comments you could do this:

print(log_name.split(".log",1)[0].rsplit("=",1)[1] + ".log")
dstudeba
  • 8,878
  • 3
  • 32
  • 41
  • that is not a string, i am taking value from source code – Aquarius24 Sep 27 '15 at 20:59
  • import urllib url = 'http://www.google.com" logfile = urllib.urlopen(url) logfile = logfile.read() logfile= logfile.split(".log",1)[0].rsplit("=",1)[1] + ".log") – Aquarius24 Sep 27 '15 at 21:00
0
   import re
    st = " <!DOCTYPE html><html><head><title>Intro</title></head><body><a href='/name=t1.304.log'>Test</a>.  </body></html>"

    mo = re.search('(t\S*log)', st)

    print(mo.group())

output

t1.304.log
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
0

You could use a regular expression (with the re module), assuming your string variable is page_source:

>>> import re
>>> re.findall('.*=(.*.log)', page_source)
['t1.304.log']

This gives you a list of all matching "*.log" substrings.

But, be warned, apparently it is not advisable to use regular expressions to parse HTML - see this discussion.

In fact, don't do this, use alecxe's answer.

Community
  • 1
  • 1
Leo
  • 1,077
  • 11
  • 24