Scraping an HTML Table
To scrape data from a webpage, you use code to open an internet explorer window, navigate to a page, and possibly interact with different elements on the page (such as clicking links, filling in textboxes, scrolling, etc). In this tutorial we will be scraping data from NBA.com’s statistics page. I feel this is a good place to start because what we are going to do is pretty straight forward, covers a few different concepts, and will not put any strain on the website because the information we will pull is contained on a single page.
Here is a snippet of code to do this:
Sub scrapeData() dim ie as Object dim url as String ' Set url to the page you want to scrape url = "http://stats.nba.com/leaders" ' Create your Internet Explorer object Set ie = CreateObject("InternetExplorer.Application") ' Make it visible ie.Visible = True ' Navigate to the webpage ie.navigate url ' Wait while the page is loading While ie.Busy DoEvents Wend
Before we proceed, there are a few things to cover. The Internet Explorer object has a number of built in attributes and methods that we will be taking advantage of. Almost everything we use will be within the document attribute which can be accessed via ie.document. The following methods will be used frequently:
.getElementsByClassName()
This returns a collection of items from the html page that have the specified class name
.getElementsByName()
This returns a collection of items from the html page that have the specified name
.getElementsByTagName()
This returns a collection of items from the html page that have the specified tag
When ever you have a collection of elements, you can refer to a single element using parenthesis, or loop through each element using a for each
loop.
.getElementByid()
This returns a single item that has the specified id
.innerhtml
This returns the actual html code contained in the document
.outerhtml
Very similar to innerhtml as it returns the html, but it also includes the opening and closing tags
.innertext
Returns the actual text being displayed, leaving out the html tags
.click
Can be used to click buttons on the webpage
.value
Can be used to retrieve the value of an element, but also to set the value of an element. For instance, you can fill in textboxes to submit forms
The data we want to scrape is the statistics for each player. It appears to be contained in a table, and the table is scrollable by clicking the arrows in the bottom right. What we want to do is scrape the first 50 players, click the over button, scrape the next 50, and so on until we have scraped all of the players.
One nice thing about modern web browsers is they all have built-in developer functions that will help us figure out how to access the page elements that we will need. The screenshots included in this tutorial will be from Safari, but Chrome and Firefox offer very similar interfaces. Typically you can right click on a portion of the webpage and select ‘Inspect’ and a window will open that shows the source code of that portion of the website.
In our case, right click on the first number in the table we want to scrape and select inspect. Here is what we see:We can see that the data we want is in an html table and the table is in the <div class=”nba-stat-table__overflow”> tag. We can extract just this <div> element by using the getElementsByClassName() function. From there we can work on looping through the rows and columns of the table using the getElementsByTagName() function.
Set Table = ie.document.getelementsbyclassname("nba-stat-table__overflow") Set tRows = Table(0).getelementsbytagname("tr") ' Create variables to track the row and column to output the ' text on our spreadsheet rNum = 1 cNum = 1 ' First we can get the column headings which use the "th" tag Set tHead = Table(0).getelementsbytagname("th") ' Loop through each column heading For Each h In tHead ' Output the contents of the cell to the spreadsheet Sheet1.Cells(rNum, cNum).Value = h.innertext ' Increase the cNum value so the next time around the 'data will output to the column to the right cNum = cNum + 1 Next ' Move on to the next row before pulling the data and reset ' the column back to 1 rNum = rNum + 1 cNum = 1 ' Loop through each row in the table For Each r In tRows ' Within each row, pull each cell by using the ' getelementsbytagname method and use the table tag "td" Set tcells = r.getelementsbytagname("td") ' Loop through each cell of the row For Each c In tcells ' Output the contents of the cell to the ' spreadsheet Sheet1.Cells(rNum, cNum).Value = c.innertext ' Increase the cNum value so the next time around ' the data will output to the column to the right cNum = cNum + 1 Next ' When we switch to the next row of the table, ' increase the rNum value so we go to the next ' row of our spreadsheet, and also reset back to ' column number 1 rNum = rNum + 1 cNum = 1 Next
This code will loop through each row of the table and then loop through each cell of each row and output the contents to the spreadsheet. You may notice that there are actually six ‘pages’ of this table on the site. If you want to loop through each page, then you will need to create a for loop that runs six times. This will require placing your initial counters outside of the loop (so they do not get reset to one each time) and it may be worth scraping how many pages there are on the table so if that happens to change, your code adjusts accordingly. To figure out how many pages there are, you can use the following code:
' If you inspect the source code you will see that the ' information containing how many pages there are is ' contained the 'stats-table-pagination__info' tag so ' we will pull that Set temp = ie.document.getelementsbyclassname("stats-table- pagination__info") ' Since temp is a collection of elements, to access our item, ' we want to get the first item in the collection, so we use ' the (0) to indicate we want the innertext of the first item. numPages = temp(0).innertext ' We need to parse the text a little to get the actual number ' we are looking for. So we will figure out where the string ' 'of' is, then take everything after that. We could just say ' give us the last character, but our method stays felxible in ' the event that there are blank spaces, or the number we are ' trying to get is double digits pos = InStr(numPages, "of") np = Mid(numPages, pos + 3, Len(numPages) - pos) ' Trim away any extra white space to just get our number which ' is now stored in the variable np np = Trim(np)
When we put this all together, we get the following code. You should be able to copy and paste this into a module and run it to pull all of the data from the table we want:
Sub getNBAdata() 'Declare all of our variables Dim ie As Object Dim btn As Object Dim tRows As Object Dim temp As Object Dim Table As Object Dim tHead As Object Dim tCells As Object Dim np As Variant Dim numPages As String Dim url As String Dim pos As Integer Dim rNum As Integer Dim cNum As Integer ' Set url to the page you want to scrape url = "http://stats.nba.com/leaders" ' Create your Internet Explorer object Set ie = CreateObject("InternetExplorer.Application") ' Make it visible ie.Visible = True ' Navigate to the webpage ie.navigate url ' Wait while the page is loading While ie.Busy DoEvents Wend ' Wait an additional 3 seconds for good measure Application.Wait DateAdd("s", 3, Now) ' The following set of code will determine how many pages to ' iterate through. When you look at the page you see that ' there are 6 pages. Because this might change, I don't want ' to hardcode the 6, so I am going to scrape that information ' from the page. Looking at the code, the information I want ' is in a section called stats-table-pagination__info, so I ' will use the getelementsbyclassname method to pull that ' section. Set temp = ie.document.getelementsbyclassname("stats-table-pagination__info") ' Since temp is a collection of elements, to access our item, ' we want to get the first item in the collection, so we use ' the (0) to indicate we want the innertext of the first item. numPages = temp(0).innertext ' We need to parse the text a little to get the actual number we ' are looking for. So we will figure out where the string 'of' ' is, then take everything after that. We could just say give us ' the last character, but this method stays felxible in the ' event that there are blank spaces, or the number we are trying ' to get is double digits pos = InStr(numPages, "of") np = Mid(numPages, pos + 3, Len(numPages) - pos) ' Trim away any extra white space to just get our number. np = Trim(np) ' Create variables to track the row and column to output the ' text on our spreadsheet rNum = 1 cNum = 1 ' This for loop repeats this same process for every page For i = 1 To np ' Looking at the source code we see that the table we want to ' scrape data from is in class called "nba-stat-table__overflow" ' so we can use the getelementsbyclass name method to get that Set Table = ie.document.getelementsbyclassname("nba-stat-table__overflow") ' Again, Table is now a collection, so to refer to the first ' item in the collection, we use (0) We want to pull each row in ' that section, so we can use the getelementsbytagname method ' and use the table row tag of "tr" Set tRows = Table(0).getelementsbytagname("tr") ' First we can get the column headings which use the "th" tag Set tHead = Table(0).getelementsbytagname("th") ' Loop through each column heading For Each h In tHead ' Output the contents of the cell to the spreadsheet Sheet1.Cells(rNum, cNum).Value = h.innertext ' Increase the cNum value so the next time around the data will ' output to the column to the right cNum = cNum + 1 Next ' Move on to the next row before pulling the data and reset the ' column back to 1 rNum = rNum + 1 cNum = 1 ' Loop through each row in the table For Each r In tRows ' Within each row, pull each cell by using the ' getelementsbytagname method and use the table tag "td" Set tCells = r.getelementsbytagname("td") ' Loop through each cell of the row For Each c In tCells ' Output the contents of the cell to the spreadsheet Sheet1.Cells(rNum, cNum).Value = c.innertext ' Increase the cNum value so the next time around the data will ' output to the column to the right cNum = cNum + 1 Next ' When we switch to the next row of the table, increase the rNum ' value so we go to the next row of our spreadsheet, and also ' reset back to column number 1 rNum = rNum + 1 cNum = 1 Next ' After we have pulled all the data from the table, click the ' button on the page to go to the next page of the table before ' starting over again Set btn = ie.document.getelementsbyclassname("stats-table-pagination__next") btn(0).Click ' Wait while the page updates While ie.Busy DoEvents Wend ' If this while loop isn't working and the code is proceeding ' before the table has updated to the new page, then you can ' add another two second wait to the code after the loop ' Since this will run 6 times, that will add 12 seconds to the ' running time of the code. Just uncomment the line below to ' add the extra wait time. 'Application.Wait DateAdd("s", 2, Now) Next ' Quit the internet explorer application ie.Quit ' Clear the ie object. This probably isn't necessary, but helps ' clean things up Set ie = Nothing End Sub
You can obtain a copy of the Excel file with this code here.
I am getting an error as “Object variable or With block variable not set”
LikeLike
I’m getting the same error of “Object variable or With block variable not set”
LikeLike
Did you try what was suggested in my April 8th reply? Did you try pausing for 5 or even 10 seconds (not practical for any sort of mass text scraping) to ensure the page loaded? Because most likely what is happening is the code is continuing before the page has fully loaded.
LikeLike
numPages = temp(0).innertext . please help me to fix the error- “object variable or with block variable not set ” for above code
LikeLike
I believe what is happening is the “While ie.Busy DoEvents Wend” loop is failing us. The point of that loop is to pause and wait for the page to finish loading before proceeding. I find the method to be very finicky and not super reliable, but it remains what most people suggest to do to wait for a page load. So basically, the page is saying it is done loading, so our code proceeds with setting the temp variable to: “ie.document.getelementsbyclassname(“stats-table-pagination__info”)” and that element hasn’t loaded yet. So when we go to retreive the first item from temp, it is saying there is nothing there (i.e. object variable not set). So the solution? Well, I am still trying to figure out the most full-proof way to wait for a page to load without waiting any longer than absolutely necessary, but a simple hack you can do is add this line right after the While loop ends:
Application.Wait DateAdd(“s”, 3, Now)
This will make the code pause for another 3 seconds (which should be enough time for the page to completely load). If the error still pops up, you could change the 3 to a 5 to make it wait 5 seconds which may be more appropriate for a slower internet connection.
I’ve updated the post to include the Application.Wait DateAdd(“s”, 3, Now) line to try and prevent this from happening to other people going forward.
Hopefully that fixes things for you.
LikeLike
Thanks for your code, helped me a lot !!!
But in my opinion you forgot some things:
1) Define all variables
Dim i As Variant
Dim h As Object
Dim c As Object
Dim r As Object
2) It’s not Sheet1 but Sheets(1)
This little changes worked for me.
Thanks again!
LikeLike
First of all I would like to say great blog! I had a quick question in which I’d like to ask
if you do not mind. I was interested to know how you center yourself and clear your thoughts prior
to writing. I have had a tough time clearing my
thoughts in getting my thoughts out. I do take pleasure
in writing however it just seems like the first 10 to
15 minutes are usually wasted just trying to figure
out how to begin. Any suggestions or tips? Thanks!
LikeLike
Hi! is there any way to modify the code to get the table from this web? Thanks a lot
https://markets.cboe.com/us/equities/market_statistics/book/AAPL/
LikeLike