4

please help i am doing a html parsing using MSHTML. My code for getting all attributes of a particular tag is like this

void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
    IHTMLDOMNode *pElemDN = NULL;
    LONG lACLength;
    MSHTML::IHTMLAttributeCollection *pAttrColl;
    IDispatch* pACDisp;
    VARIANT vACIndex;
    IDispatch* pItemDisp;
    IHTMLDOMAttribute* pItem;
    BSTR bstrName;
    VARIANT vValue;
    VARIANT_BOOL vbSpecified;
    pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
    if (pElemDN != NULL)
    {
        pElemDN->get_attributes(&pACDisp);
        pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
        pAttrColl->get_length(&lACLength);
        vACIndex.vt = VT_I4;
        for (int i = 0; i < lACLength; i++)
        {

            vACIndex.lVal = i;
            pItemDisp = pAttrColl->item(&vACIndex);
            if (pItemDisp != NULL)
            {
               pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
               pItem->get_specified(&vbSpecified);
               pItem->get_nodeName(&bstrName);
               pItem->get_nodeValue(&vValue);

               if (vbSpecified)
                cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
               pItem->Release();
            }
            pItemDisp->Release();

        }
        pElemDN->Release();
        pACDisp->Release();
        pAttrColl->Release();
    }
}

The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK"> it prints all attributes except value attribute. The get_specified function is returning false for value attribute.

My output is

id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch

Any idea why? Also which other attributes may have this problem??

Note

I tried like this. Its showing the correct attribute results for value.

        if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
        {
            cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
        }
999k
  • 6,257
  • 2
  • 29
  • 32
  • What does your Note mean? Is it due to the vbSpecified test? – Simon Mourier Jun 03 '13 at 06:42
  • I added Note to show correct value is in vValue.bstrVal. But still vbSpecified is returning false – 999k Jun 03 '13 at 07:05
  • Not sure the specified flag is always meaningful. Have you tried to change the document compatibility mode (http://msdn.microsoft.com/en-us/library/cc288325.aspx). For example, specified is always TRUE when IE is in IE9 'Standards mode'. – Simon Mourier Jun 03 '13 at 07:12
  • @SimonMourier I want to parse every tag and every attribute in a html document. Is there any other way using cpp. i already started html parsing using MSHTML. Any advice will be helpfull – 999k Jun 03 '13 at 07:13
  • My web page is in IE-8 compatible mode. And i didnt find any documents mentioning this type of information about get_specified. For input type text, get_specified is returning false fro attribute type. But its working for input type button – 999k Jun 03 '13 at 07:20
  • You should really consider spacing out and commenting your code, so that people who don't know that specific library really well could also try to assist. – idoby Jun 06 '13 at 19:45
  • I have this code from one of examples from msdn sites.so i dont expect any bug in my code.sorry I expect answer from someone who used attributes. thats why i didnt add any comments – 999k Jun 07 '13 at 03:56

4 Answers4

3

Do you really care about the flag of specified? You said you want to process all attributes, I think if this is the case you don't need to care about the specified flag, just process all attributes.

Other thing is if I were you, I'll use CComPtr to instead of all naked com pointer.

Renjie
  • 41
  • 2
  • I am not that much familiar with Visual studio and other advanced C++ terms like CComPtr. I dont know which all attributes are there in my tag. So if i use get_nodeValue() withut checking specified flag its returning null pointer and even bad pointer some times. – 999k Jun 07 '13 at 04:49
3

If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.

If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.

This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2

Another one is: http://pugixml.org/

This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency

wp78de
  • 18,207
  • 7
  • 43
  • 71
cpz
  • 1,002
  • 2
  • 12
  • 27
  • 1
    Thanks for your time and answer. Yes i know there are so many other parsers. After waiting for 2 3 days and no reply here i selected another HTML parser mentioned in another SO thread – 999k Jun 07 '13 at 11:57
2

I've never worked with this before, but according to the library docs and DOM specs, it seems that get_nodeValue() does different things depending on the type of "node object". Try calling get_nodeValue() or get_nodeName() on the IHTMLDOMNode object. It seems clear that some properties like "value", "ID" and "Name" are not part of the attribute collection under the DOM.


MSHTML docs:

DOM spec:

idoby
  • 907
  • 8
  • 20
  • 1
    Thank you for your time. Actually get_nodeName() returns tag name ie INPUT in my case not the attribute name. And i checked almost all those interfaces of IHTMLDOMNode also in my code. – 999k Jun 07 '13 at 03:59
  • Also problem is not in interface funtion get_nodeValue(). From my note it is clear that this function returns correct value, but get_specified is returning false even if it is specified in the tag – 999k Jun 07 '13 at 04:01
  • Sorry, I must have misunderstood the question (never used this library before). Both documents listed in my answer state that the specified flag should be true for the value attribute. This is an old MS library and it may have bugs though. I'd recommend switching to a more generic XML parsing engine like cpz suggested in his answer. – idoby Jun 07 '13 at 06:47
2

check for the input type, then query for the IID_IHTMLInputElement interface, then use get_value.

mark
  • 5,269
  • 2
  • 21
  • 34