0

I have a school project that I am working on and the outcome is pointless it seems, but it's got more to do with the experience gained through this I believe. What I am trying to do is submit an initial URL, then pull all the URLs on that page and visit them in order and do this until I tell it to stop. All of the URLs will be recorded in a text file. So far, I am able to open a window in IE and launch a webpage of my choosing. So now I need to know how to send IE to a new webpage using the same session and also how I can scan and pull data from the websites I visit. Thanks for any help!

Here is my code so far:

#include <string>
#include <iostream>
#include <windows.h>
#include <stdio.h>
#include <tchar.h>

using namespace std;

int main( int argc, TCHAR *argv[] )
{
    std::string uRL, prog;
    int length, count;

    STARTUPINFO si;
    PROCESS_INFORMATION pi;

    ZeroMemory( &si, sizeof(si) );
    si.cb = sizeof(si);
    ZeroMemory( &pi, sizeof(pi) );

    //if( argc != 2 )
    //{
    //    printf("Usage: %s [cmdline]\n", argv[0]);
    //    system("PAUSE");
    //    return 0;
    //}

    std::cout << "Enter URL: ";
    std::cin >> uRL;

    prog = ("C:\\Program Files\\Internet Explorer\\iexplore.exe ") + uRL;

    char *cstr = new char[prog.length() + 1];
    strcpy(cstr, prog.c_str());

    // Start the child process. 
    if( !CreateProcess(NULL,   // No module name (use command line)
        _T(cstr),        // Command line
        NULL,           // Process handle not inheritable
        NULL,           // Thread handle not inheritable
        FALSE,          // Set handle inheritance to FALSE
        0,              // No creation flags
        NULL,           // Use parent's environment block
        NULL,           // Use parent's starting directory 
        &si,            // Pointer to STARTUPINFO structure
        &pi )           // Pointer to PROCESS_INFORMATION structure
    ) 
    {
        printf( "CreateProcess failed (%d).\n", GetLastError() );
        system("PAUSE");
        return 0;
    }

    cout << HRESULT get_Count(long *Count) << endl;

    //cout << count << endl;

    system("PAUSE");

    // Wait until child process exits.
    WaitForSingleObject( pi.hProcess, INFINITE );

    // Close process and thread handles. 
    CloseHandle( pi.hProcess );
    CloseHandle( pi.hThread );

    delete [] cstr;

    return 0;
}
DonkeyKong
  • 1,005
  • 14
  • 18
  • `_T` on a variable will fail if you have `UNICODE` enabled. Best to just use wide strings anyway. And this task is a lot more complicated than just opening up an instance of IE. – chris Jun 02 '13 at 01:52
  • alright I'll get rid of the _T. I just used it because it was used in the examples i looked at. Telling me that it is difficult doesn't answer the question. Telling me how to send another URL answers part of the question though :) – DonkeyKong Jun 02 '13 at 01:59
  • Honestly, I don't know how to parse out links in anything other than JS. There's probably a library that would help parse them at least. I know there's a COM interface for interacting with IE, though. – chris Jun 02 '13 at 02:02
  • I've seen COM mentioned a lot while searching for this. Everyone also has said that it's not easy to use. I will look into it though. – DonkeyKong Jun 02 '13 at 02:04
  • Yeah, it's really not my cup of tea. I can get by, but only after a lot of searching and looking at documentation and everything. I have a fair bit of the winapi memorized from repeated use, but that eludes me. – chris Jun 02 '13 at 02:09
  • If you know the WinAPI, doesn't that use a lot of handles? I have a handle on IE now I just need to manipulate it... at least that sounds right in my head. – DonkeyKong Jun 02 '13 at 02:15
  • Web content is a looot different than controls on desktop applications. – chris Jun 02 '13 at 02:30
  • Chris, you'll be proud of me. I figured out part of my answer. Still need to grab the URLs from the page. This will open a new tab, unfortunately that means that the first one remains open and now unused. ShellExecute(NULL, "open", "http://www.microsoft.com", NULL, NULL, SW_SHOWNORMAL); found that in the WinAPI ;) – DonkeyKong Jun 02 '13 at 02:33

2 Answers2

1

If you want to crawl a webpage launching Internet Explorer is not going to work very well. I also don't recommend attempting to parse the HTML page yourself unless you are prepared for a lot of heartache and hassle. Instead I recommend that you create an instance of an IWebBrowser2 object and use it to navigate to the webpage, grab the appropriate IHTMLDocument2 object and iterate through the elements picking out the URL's. It's far easier and is a common approach using components that are already installed on Windows. The example below should get your started and on your way to crawling the web like proper spider should.

#include <comutil.h>    // _variant_t
#include <mshtml.h>     // IHTMLDocument and IHTMLElement
#include <exdisp.h>     // IWebBrowser2
#include <atlbase.h>    // CComPtr
#include <string>
#include <iostream>
#include <vector>

// Make sure we link in the support library!
#pragma comment(lib, "comsuppw.lib")


// Load a webpage
HRESULT LoadWebpage(
    const CComBSTR& webpageURL,
    CComPtr<IWebBrowser2>& browser,
    CComPtr<IHTMLDocument2>& document)
{
    HRESULT hr;
    VARIANT empty;

    VariantInit(&empty);

    // Navigate to the specifed webpage
    hr = browser->Navigate(webpageURL, &empty, &empty, &empty, &empty);

    //  Wait for the load.
    if(SUCCEEDED(hr))
    {
        READYSTATE state;

        while(SUCCEEDED(hr = browser->get_ReadyState(&state)))
        {
            if(state == READYSTATE_COMPLETE) break;
        }
    }

    // The browser now has a document object. Grab it.
    if(SUCCEEDED(hr))
    {
        CComPtr<IDispatch> dispatch;

        hr = browser->get_Document(&dispatch);
        if(SUCCEEDED(hr) && dispatch != NULL)
        {
            hr = dispatch.QueryInterface<IHTMLDocument2>(&document);
        }
        else
        {
            hr = E_FAIL;
        }
    }

    return hr;
}


void CrawlWebsite(const CComBSTR& webpage, std::vector<std::wstring>& urlList)
{
    HRESULT hr;

    // Create a browser object
    CComPtr<IWebBrowser2> browser;
    hr = CoCreateInstance(
        CLSID_InternetExplorer,
        NULL,
        CLSCTX_SERVER,
        IID_IWebBrowser2,
        reinterpret_cast<void**>(&browser));

    // Grab a web page
    CComPtr<IHTMLDocument2> document;
    if(SUCCEEDED(hr))
    {
        // Make sure these two items are scoped so CoUninitialize doesn't gump
        // us up.
        hr = LoadWebpage(webpage, browser, document);
    }

    // Grab all the anchors!
    if(SUCCEEDED(hr))
    {
        CComPtr<IHTMLElementCollection> urls;
        long count = 0;

        hr = document->get_all(&urls);

        if(SUCCEEDED(hr))
        {
            hr = urls->get_length(&count);
        }

        if(SUCCEEDED(hr))
        {
            for(long i = 0; i < count; i++)
            {
                CComPtr<IDispatch>  element;
                CComPtr<IHTMLAnchorElement> anchor;

                // Get an IDispatch interface for the next option.
                _variant_t index = i;
                hr = urls->item( index, index, &element);
                if(SUCCEEDED(hr))
                {
                    hr = element->QueryInterface(
                        IID_IHTMLAnchorElement, 
                        reinterpret_cast<void **>(&anchor));
                }

                if(SUCCEEDED(hr) && anchor != NULL)
                {
                    CComBSTR    url;
                    hr = anchor->get_href(&url);
                    if(SUCCEEDED(hr) && url != NULL)
                    {
                        urlList.push_back(std::wstring(url));
                    }
                }
            }
        }
    }
}

int main()
{
    HRESULT hr;

    hr = CoInitialize(NULL);
    std::vector<std::wstring>   urls;

    CComBSTR webpage(L"http://cppreference.com");


    CrawlWebsite(webpage, urls);
    for(std::vector<std::wstring>::iterator it = urls.begin();
        it != urls.end();
        ++it)
    {
        std::wcout << "URL: " << *it << std::endl;

    }

    CoUninitialize();

    return 0;
}
Captain Obvlious
  • 19,754
  • 5
  • 44
  • 74
  • This sounds like exactly what I'm after, albeit it will involve scrapping the majority of what I'd already done but what I'm doing may not end up going anywhere anyway. I'm having some difficulty compiling this example though. My compiler doesn't seem to like the first 4 header files. – DonkeyKong Jun 02 '13 at 03:41
  • It's not necessarily the _right_ way it's just an easy approach that doesn't require a lot of extra code and you don't have to parse the HTML page yourself. Since you didn't specify the compiler you are using all I can suggest is that you may need to install the Windows SDK. – Captain Obvlious Jun 02 '13 at 03:45
  • I'm not sure how well MinGW (the compiler DevC++ uses) handles the Windows SDK provided by MS. It's hard to make a recommendation for this. You have the option of installing the SDK and giving it a shot, using [Visual Studio Express](http://www.microsoft.com/visualstudio/eng/products/visual-studio-express-for-windows-8#product-express-windows) which is **free**, using third party libraries for downloading and parsing the HTML pages or writing an HTML parser yourself. – Captain Obvlious Jun 02 '13 at 04:09
  • I downloaded the Windows 7 SDK. I have visual studio express 2010. Using visual studio I get this: 1>------ Build started: Project: URL, Configuration: Debug Win32 ------ 1>LINK : error LNK2001: unresolved external symbol _mainCRTStartup 1>c:\users\owner\documents\visual studio 2010\Projects\URL\Debug\URL.exe : fatal error LNK1120: 1 unresolved externals ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ========== – DonkeyKong Jun 02 '13 at 04:41
  • Sounds like you created it as a _Win32 Project_ instead of a _Win32 Console Application_. Try recreating the project as a console app and if the problem persists consider posting a new question as it's a totally different problem (That's Stackoverflow etiquette) – Captain Obvlious Jun 02 '13 at 04:57
  • Did this work for you? I couldn't get it. It's possible the SDK didn't install properly cause it asked me if it installed right. – DonkeyKong Jun 02 '13 at 05:16
  • Yes, I tested it before I posted. – Captain Obvlious Jun 02 '13 at 05:19
  • WORKS! I would vote up but I can't. Crazy program here. I could probably ask a million questions about it. Thanks. Oh yeah, problem was that I made a new project and didn't add the file to the project (just did file->new). – DonkeyKong Jun 02 '13 at 15:11
0

To scan and pull data from the websites, you'll want to capture the HTML and iterate through it looking for all character sequences matching a certain pattern. Have you ever used regular expressions? Regular expressions would by far be the best here, but if you understand them (just look up a tutorial on the basics) then you can manually apply the pattern-recognition concepts to this project.

So what you're looking for is something like http(s)://.. It's more complex though, because domain names are a rather intricate pattern. You'll probably want to use a third-party HTML parser or regular expression library, but it's doable without it, although pretty tedious to program.

Here's a link about regular expressions in c++: http://www.johndcook.com/cpp_regex.html

Nathan
  • 73,987
  • 14
  • 40
  • 69
  • 1
    [I hope you're not serious about parsing HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – chris Jun 02 '13 at 02:49
  • "_Regular expressions would by far be the best here_" - worst piece of advice I've seen today. – Captain Obvlious Jun 02 '13 at 03:17
  • Ah Chris, the top response in that link made my life. Well, thanks for the corrections, I wasn't aware of that. The basic idea of pattern matching still provides a good way to think of things, even if regular expressions themselves aren't applicable here. – Nathan Jun 02 '13 at 04:48
  • Nathan, this seems a viable solution also. Captain Obvlious' was much better, but also incredibly complex and used lots of specific references and what not. If you read farther down the post a lot of the guys talk about being able to sort through html and in my case I wouldn't be looking for anything but http://, www., or .com and those wouldn't get caught up in html I don't think. Either way, couldn't get the URLDownloadtoFile or whatever function to work. – DonkeyKong Jun 02 '13 at 15:14
  • I'm not 100% sure (I'm not that well versed in HTML), but aren't links always contained located after href= in HTML? From looking at the HTML source of this site (ctrl+u in firefox), it seems like links are after href= and then are in quotes. It seems like there are two kinds of links in the href, local or external. The local ones are something like href="/questions/1421/how-to-teach-your-computer-to-fly" that go directly after the stackoverflow.com local domain name, and global ones like href="www.cnn.com". So If something is after href= and contained in quotation marks, it's a link. – Nathan Jun 02 '13 at 21:41
  • You'd probably want to just follow the global ones, in other words all the ones of the pattern: href="http[anything-thats-not-a-quotation-mark]" You could do this by loading the HTML into a string (or treating it as a filestream or something) and iterating through it searching for this pattern. Look up the basics of a lexical analyser or scanner, it will give you some ideas. The only problem here is you can get a few duplicates. For instance, if I type href="www.cnn.com" here, then in the underlying HTML of this site there are two matches to that pattern (think about it) – Nathan Jun 02 '13 at 22:16
  • @CaptainObvlious Regular expressions would work here, although they might occasionally capture duplicates. I actually wrote a proof-of concept python script, and it worked wonderfully in extracting the links off of the CNN main page. For certain HTML parsing tasks, REs do work. – Nathan Jun 02 '13 at 22:19
  • After looking into it a little more, it seems like it may be possible for HTML links to contain double quotes within themselves. If this is the case, then you'd need to alter the pattern you're searching for to take that into account. – Nathan Jun 02 '13 at 22:25
  • personally, I think I would want to look for the " followed by a space or something. There typically should be something like that following a link. Or a > or a , either way these would be searchable. Still the above program runs pretty slick. It is specific to Visual Studio though which bugs me. – DonkeyKong Jun 02 '13 at 22:31
  • Here's part of a python script I used to test it. This doesn't allow for double quotes in the link, but that could be fixed: results = re.findall('href="http[^"]*"', html_source) In this, [^"]* means zero or more matches to all characters except double quotes. This leaves href= and quotes in the end result, but that would be easy enough to remove. – Nathan Jun 02 '13 at 22:33
  • Extended discussions in comments are discouraged on Stackoverflow and not appropriate. I am more than happy to discuss this issue with you and have created a [chat room](http://chat.stackoverflow.com/rooms/31080/the-captains-galley) to specifically address it. – Captain Obvlious Jun 02 '13 at 22:41