3

I need to extract the parent tags in html by matching the string in html. (i.e) I have many raw html sources. Each source contains the text value "VIN:*"** with some characters. This text value(VIN:*) is placed in various formats in each source like "< ul >" , "< div >".etc..

Then I need to extract all values along with that "VIN:*" string. It means I need to get its parent tag.

For example,

<div class="class1">

                            Stock Number:
                            Z2079
                            <br>
                            **VIN:
                            2T2HK31UX9C110701**
                            <br>
                            Model Code:
                            9424
                            <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Here I have the "VIN" for the html source. Similar to that I have VIN for other html sources also in different format.

These values have to be extracted in Python.

Is there any way to extract the parent tag by matching the string in Python also in effective way?

Nicolas Kaiser
  • 1,628
  • 2
  • 14
  • 26
Nava
  • 6,276
  • 6
  • 44
  • 68
  • 2
    NEVER use regexps to parse HTML. Especially in python you can do it much better using BeautifulSoup... – ThiefMaster Dec 30 '11 at 13:03
  • 1
    @saravana The ThiefMaster's ukase and assertion are controversial and unargmented opinions that I contest. I say that to counterbalance the general religious belief he expresses. ( why religious: see this answer: http://stackoverflow.com/a/1732454/551449) – eyquem Dec 30 '11 at 15:08
  • @saravan Regular expressions are far more faster than _BeautifulSoup_ and _lxml_ – eyquem Dec 30 '11 at 15:10

3 Answers3

3

I would strongly recommend going with BeautifulSoup on this; it provides some incredibly convenient functionality for parsing HTML. Here, for example, is how I would go about finding every text node that contains "VIN" in either case:

soup = your_html_here
vins = soup.findAll(text = lambda(x): x.lower.index('vin') != -1)

From there, you simply walk through that collection, grab each node's parent, grab said parent's contents, and parse them as you see fit:

for v in vins:
    parent_html = v.parent.contents
    # more code here
ranksrejoined
  • 1,229
  • 9
  • 9
  • For the findAll execution.It runs into : AttributeError: 'builtin_function_or_method' object has no attribute 'index' – Nava Dec 30 '11 at 10:10
  • Sorry about that, lower() is of course what you're looking for. Ruby has spoiled me. – ranksrejoined Dec 30 '11 at 10:12
  • 2
    `soup = BeautifulSoup(htmltext); vins = soup(text=lambda x: 'vin' in x.lower())` – jfs Dec 30 '11 at 13:03
1

For a so simple task, that consists in ANLYZING the string, not PARSING it (parsing = building a tree representation of the text), you can do :

the text

ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
    Stock Number:
    Z2079
    <br>
        **VIN:
        2T2HK31UX9C110701**
    <br>
    Model Code:
    9424
    <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Humpty Dumpty had a great fall
<ul cat="zoo">
    Stock Number:
    ARDEN3125
    <br>
        **VIN:
        SHAKAMOSK-230478-UBUN**
    </br>
    Model Code:
    101
    <img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>

All the king's horses and all the king's men
<artifice>
    <baradino>
        Stock Number:
        DERT5178
        <br>
            **VIN:
            Pandaia-67-Moro**
        <br>
        Model Code:
        1234
        <img class="imgcert" src="/images/Pertuis_cpo.jpg">
    </baradino>
    what what what who what
    <somerset who="maugham">
        Nothing to declare
    </somerset>
</artifice>

Couldn't put Humpty Dumpty again
<ending rtf="simi">
    Stock Number:
    ZZZ789
    <br>
        **VIN:
        0000012554-ENDENDEND**
    <br>
    Model Code:
    QS78-9
    <img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>

qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh''' 

the code:

import re

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

li = [ (mat.group(1),mat.group(2),mat.group(3).strip(' \n\r\t'))
       for mat in regx.finditer(ss) ]

for el in li:
    print '(%-15r, %-25r, %-25r)' % el

the result

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('baradino'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

re.DOTALL is necessary to give to the dot symbol the ability to match also the newline (by default , a dot in a regular expression pattern matches every character except newlines)

\\1 is way to specify that at this place in the examined string, there must be the same portion of string that is captured by the first group, that is to say the part ([^ >]+)

'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)' is a part that says that it is forbidden to find a tag other than <br> before the first tag <br> encountered between an opening tag and the closing tag of an HTML element.
This part is necessary to catch the closest preceding tag before VIM apart <br>
If this part isn't present , the regex

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

catches the following result:

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('artifice'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

The difference is 'artifice' instead of 'baradino'

eyquem
  • 26,771
  • 7
  • 38
  • 46
0

For a pure string version without using any xml/html-parser you might try regular expressions(re):

import re

html_doc = """ <div ...VIN ...  /div>"""

results = re.findall('<(.+>).*VIN.*+</\1', html_doc)
Don Question
  • 11,227
  • 5
  • 36
  • 54
  • I would like to point out a very relavant answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Krystian Cybulski Dec 30 '11 at 12:26