-3

i want to parsing below example in html

example is part of specific html.

<p>NUCLEAR EK:</p>

<ul>
<li>2015-01-29 17:22:12 UTC - culturemerge.ga - GET /AgJVAhoAGFpMUAVU.html</li>
<li>2015-01-29 17:22:13 UTC - culturemerge.ga - GET /AU4STwAHU1NMUUlcSlMHVAFRVwJTB1RKVx1XA1ZMAVUFSgRWTwBfVg</li>
<li>2015-01-29 17:22:15 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoGGDpaIEUi</li>
<li>2015-01-29 17:22:17 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoGGBpgEF8mYRhdIk9W</li>
<li>2015-01-29 17:22:21 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoEGDpaIEUi</li>
<li>2015-01-29 17:22:22 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoEGBpgEF8mYRhdIk9W</li>
<li>2015-01-29 17:22:23 UTC - culturemerge.ga - GET /AU4STwAHU1NMUUlcSlMHVAFRVwJTB1RKVx1XA1ZMAVUFSgRWTxVaCBRVEA</li>
<li>2015-01-29 17:22:25 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoLGDpaIEUi</li>
<li>2015-01-29 17:22:28 UTC - culturemerge.ga - GET /Al8OVhpVUFUBHgYYDh4CUgFWVwVQBFYGHgZIAlRQHlMCVBhQBxoLGBpgEF8mYRhdIk9W</li>
</ul>

i want to get content <p>~</ul>

so i make pcre python code below:

temp=re.findall(r"<p>[^\"\&\;]*?<\/p>\s*<ul>\s*<li>\d(.|\s)*?<\/ul>",html)
        print temp

this pcre is work well in notepad++ or Regex Coach

but in python it do not work parsing!

it show only empty list like []

Somputer
  • 1,223
  • 2
  • 11
  • 20
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – msw May 20 '15 at 01:50
  • no... my goal is to find specific contents like example and to find name between

    and

    . so i find all contents like example. my pcre work well in Regex Coach. But do not work in python re
    – Somputer May 20 '15 at 01:57
  • This question may already have an answer here: must be cancel – Somputer May 20 '15 at 02:06

2 Answers2

0

While I agree you shouldn't use regexp to parse html, sometimes it's ok. In this case I see some sort of a pattern, but I'm not quite sure about what you want to extract from the html. I'll just rewrite your regexp hoping it's what you're looking for:

temp=re.findall(r"<li>(\d{4}-\d{2}-\d{2} {\d:]{8}).* - (.*) - GET (.*)<\/li>",html)
for i in temp:
    print i

temp will contain tuples with this data (date, domain, path)

Cornel Ghiban
  • 902
  • 4
  • 6
0
    temp=re.finditer(r"<p>[^\"\&\;]*?<\/p>\s*<ul>\s*<li>\d(.|\s)*?<\/ul>",html)
    for match in temp:
        print match.group(0)
Somputer
  • 1,223
  • 2
  • 11
  • 20