I'm trying to create a program using Excel 2013 and Visual Basic. As part of that program, I'm going to have a list of TV shows in a list box. I'd like to be able to double-click on one of those and have it open to another form with a list box containing all the seasons and episodes in those season, for that show.
The best way I've found of doing that, would be to scraping Wikipedia.org. I think it'll be one of the only websites that will have this information in roughly the same format. I also plan to do this with books as well.
I originally read about scraping at this website: http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm
However, I'd never done anything with the getelementby*, so I wasn't sure how they worked. Any help there would be appreciated. After scouring the internet, the following is the best code I could piece together:
Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"
i = 4
For Each ele In html.getElementsByClassName("summary")
Sheets("Wiki2").Range("B" & i).Value = ele.innerText
i = i + 1
Next
i = 4
For Each ele In html.getElementsByClassName("mw-headline")
Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
i = i + 1
Next
End Sub
The first part seems to be a pretty generic way to get the source code for a given page. I have it currently set to pull the episodes from the tv show "Archer". The following code:
i = 4
For Each ele In html.getElementsByClassName("summary")
Sheets("Wiki2").Range("B" & i).Value = ele.innerText
i = i + 1
Next
i = 4
For Each ele In html.getElementsByClassName("mw-headline")
Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
i = i + 1
Next
is what I'm using to pull the text I'm looking for. What I need help with is combining these together. I need it to loop through and find each instance of "mw-headline" and every time if finds that, to look for the class name summary. If it finds summary, then it should display the headlines (aka Season) inner text in the cell of column A and the summary (aka episode name) in the adjacent column B cell.
These currently work independently. If you run this code, you get all the episodes of the tv show starting in cell b4 and you get a list of everything with the "mw-headline" listed in a4. The problem is that "mw-headline" is not only applied to the season, but to a couple of other things as well, hence the need to check to see if it has a "summary" class under it. This will also get rid of the instances where it says a tv show has a new season, but under it, it only says that a new season is forth coming. Without the "summary" tag, it shouldn't list it. I'd like the season to be displayed in the A column beside every episode in the B column list, so if there are 10 episodes in each season, then column A would have 10 instances of "Season 1", then 10 instances of "Season 2" and so on.
Thanks for the help and for anyone who doesn't know of stumbles on this question in the future, you need to put the following code at the top of your coding window:
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
PS - In the code "For Each ele in *", is ele an undefined variable or is it a vba recognized word that stands for element? I got this from a copy and paste job and I don't understand it. Thanks.