Scraping HTML Table with VBA

Scraping an HTML Table

To scrape data from a webpage, you use code to open an internet explorer window, navigate to a page, and possibly interact with different elements on the page (such as clicking links, filling in textboxes, scrolling, etc). In this tutorial we will be scraping data from NBA.com’s statistics page. I feel this is a good place to start because what we are going to do is pretty straight forward, covers a few different concepts, and will not put any strain on the website because the information we will pull is contained on a single page.

Here is a snippet of code to do this:

Sub scrapeData()
     dim ie as Object
     dim url as String

     ' Set url to the page you want to scrape
     url = "http://stats.nba.com/leaders"

     ' Create your Internet Explorer object
     Set ie = CreateObject("InternetExplorer.Application")

     ' Make it visible
     ie.Visible = True
     ' Navigate to the webpage
     ie.navigate url

     ' Wait while the page is loading
     While ie.Busy 
          DoEvents 
     Wend

Before we proceed, there are a few things to cover. The Internet Explorer object has a number of built in attributes and methods that we will be taking advantage of. Almost everything we use will be within the document attribute which can be accessed via ie.document. The following methods will be used frequently:

.getElementsByClassName() This returns a collection of items from the html page that have the specified class name
.getElementsByName() This returns a collection of items from the html page that have the specified name
.getElementsByTagName() This returns a collection of items from the html page that have the specified tag

When ever you have a collection of elements, you can refer to a single element using parenthesis, or loop through each element using a for each loop.

.getElementByid() This returns a single item that has the specified id
.innerhtml This returns the actual html code contained in the document
.outerhtml Very similar to innerhtml as it returns the html, but it also includes the opening and closing tags
.innertext Returns the actual text being displayed, leaving out the html tags
.click Can be used to click buttons on the webpage
.value Can be used to retrieve the value of an element, but also to set the value of an element. For instance, you can fill in textboxes to submit forms

The data we want to scrape is the statistics for each player. It appears to be contained in a table, and the table is scrollable by clicking the arrows in the bottom right.  What we want to do is scrape the first 50 players, click the over button, scrape the next 50, and so on until we have scraped all of the players.

One nice thing about modern web browsers is they all have built-in developer functions that will help us figure out how to access the page elements that we will need. The screenshots included in this tutorial will be from Safari, but Chrome and Firefox offer very similar interfaces. Typically you can right click on a portion of the webpage and select ‘Inspect’ and a window will open that shows the source code of that portion of the website.

In our case, right click on the first number in the table we want to scrape and select inspect. Here is what we see:Screen Shot 2017-09-14 at 7.31.46 PMWe can see that the data we want is in an html table and the table is in the <div class=”nba-stat-table__overflow”> tag. We can extract just this <div> element by using the getElementsByClassName() function. From there we can work on looping through the rows and columns of the table using the getElementsByTagName() function.

     Set Table = ie.document.getelementsbyclassname("nba-stat-table__overflow")
     Set tRows = Table(0).getelementsbytagname("tr")
        
     ' Create variables to track the row and column to output the
     ' text on our spreadsheet
     rNum = 1
     cNum = 1

     ' First we can get the column headings which use the "th" tag
     Set tHead = Table(0).getelementsbytagname("th")
        
     ' Loop through each column heading
     For Each h In tHead
          ' Output the contents of the cell to the spreadsheet
          Sheet1.Cells(rNum, cNum).Value = h.innertext
          ' Increase the cNum value so the next time around the 
          'data will output to the column to the right
          cNum = cNum + 1
     Next
        
     ' Move on to the next row before pulling the data and reset 
     ' the column back to 1
     rNum = rNum + 1
     cNum = 1
        
     ' Loop through each row in the table
     For Each r In tRows
            
          ' Within each row, pull each cell by using the 
          ' getelementsbytagname method and use the table tag "td"
          Set tcells = r.getelementsbytagname("td")
            
            ' Loop through each cell of the row
            For Each c In tcells
                ' Output the contents of the cell to the 
                ' spreadsheet
                Sheet1.Cells(rNum, cNum).Value = c.innertext
                ' Increase the cNum value so the next time around 
                ' the data will output to the column to the right
                cNum = cNum + 1
            Next
            
            ' When we switch to the next row of the table, 
            ' increase the rNum value so we go to the next 
            ' row of our spreadsheet, and also reset back to 
            ' column number 1
            rNum = rNum + 1
            cNum = 1
     Next

This code will loop through each row of the table and then loop through each cell of each row and output the contents to the spreadsheet. You may notice that there are actually six ‘pages’ of this table on the site. If you want to loop through each page, then you will need to create a for loop that runs six times. This will require placing your initial counters outside of the loop (so they do not get reset to one each time) and it may be worth scraping how many pages there are on the table so if that happens to change, your code adjusts accordingly. To figure out how many pages there are, you can use the following code:

     ' If you inspect the source code you will see that the 
     ' information containing how many pages there are is 
     ' contained the 'stats-table-pagination__info' tag so 
     ' we will pull that
     Set temp = ie.document.getelementsbyclassname("stats-table-    pagination__info")
    
    ' Since temp is a collection of elements, to access our item, 
    ' we want to get the first item in the collection, so we use 
    ' the (0) to indicate we want the innertext of the first item.
    numPages = temp(0).innertext
    
    ' We need to parse the text a little to get the actual number 
    ' we are looking for. So we will figure out where the string 
    ' 'of' is, then take everything after that. We could just say 
    ' give us the last character, but our method stays felxible in
    ' the event that there are blank spaces, or the number we are 
    ' trying to get is double digits
    
    pos = InStr(numPages, "of")
    np = Mid(numPages, pos + 3, Len(numPages) - pos)
    
    ' Trim away any extra white space to just get our number which
    ' is now stored in the variable np
    np = Trim(np)

When we put this all together, we get the following code. You should be able to copy and paste this into a module and run it to pull all of the data from the table we want:

Sub getNBAdata() 
     'Declare all of our variables 
     Dim ie As Object 
     Dim btn As Object 
     Dim tRows As Object 
     Dim temp As Object 
     Dim Table As Object 
     Dim tHead As Object 
     Dim tCells As Object 
     Dim np As Variant 
     Dim numPages As String 
     Dim url As String 
     Dim pos As Integer 
     Dim rNum As Integer 
     Dim cNum As Integer 

     ' Set url to the page you want to scrape 
     url = "http://stats.nba.com/leaders" 

     ' Create your Internet Explorer object 
     Set ie = CreateObject("InternetExplorer.Application") 

     ' Make it visible 
     ie.Visible = True 

     ' Navigate to the webpage 
     ie.navigate url 

     ' Wait while the page is loading 
     While ie.Busy 
          DoEvents 
     Wend 

     ' Wait an additional 3 seconds for good measure 
     Application.Wait DateAdd("s", 3, Now)

     ' The following set of code will determine how many pages to 
     ' iterate through. When you look at the page you see that 
     ' there are 6 pages. Because this might change, I don't want 
     ' to hardcode the 6, so I am going to scrape that information 
     ' from the page. Looking at the code, the information I want 
     ' is in a section called stats-table-pagination__info, so I 
     ' will use the getelementsbyclassname method to pull that 
     ' section. 

     Set temp = ie.document.getelementsbyclassname("stats-table-pagination__info") 

     ' Since temp is a collection of elements, to access our item, 
     ' we want to get the first item in the collection, so we use 
     ' the (0) to indicate we want the innertext of the first item. 
     numPages = temp(0).innertext 

     ' We need to parse the text a little to get the actual number we
     ' are looking for. So we will figure out where the string 'of' 
     ' is, then take everything after that. We could just say give us
     ' the last character, but this method stays felxible in the 
     ' event that there are blank spaces, or the number we are trying
     ' to get is double digits 
     pos = InStr(numPages, "of") 
     np = Mid(numPages, pos + 3, Len(numPages) - pos) 

     ' Trim away any extra white space to just get our number. 
     np = Trim(np) 
     
     ' Create variables to track the row and column to output the 
     ' text on our spreadsheet 
     rNum = 1 
     cNum = 1 

     ' This for loop repeats this same process for every page 
     For i = 1 To np 

     ' Looking at the source code we see that the table we want to 
     ' scrape data from is in class called "nba-stat-table__overflow"
     ' so we can use the getelementsbyclass name method to get that 
          Set Table = ie.document.getelementsbyclassname("nba-stat-table__overflow") 

     ' Again, Table is now a collection, so to refer to the first 
     ' item in the collection, we use (0) We want to pull each row in
     ' that section, so we can use the getelementsbytagname method 
     ' and use the table row tag of "tr" 
          Set tRows = Table(0).getelementsbytagname("tr") 

     ' First we can get the column headings which use the "th" tag 
          Set tHead = Table(0).getelementsbytagname("th") 

     ' Loop through each column heading 
          For Each h In tHead
 
     ' Output the contents of the cell to the spreadsheet 
               Sheet1.Cells(rNum, cNum).Value = h.innertext 

     ' Increase the cNum value so the next time around the data will
     ' output to the column to the right 
               cNum = cNum + 1 
          Next 
     ' Move on to the next row before pulling the data and reset the
     ' column back to 1 
          rNum = rNum + 1 
          cNum = 1 

     ' Loop through each row in the table 
          For Each r In tRows 
     ' Within each row, pull each cell by using the 
     ' getelementsbytagname method and use the table tag "td" 
               Set tCells = r.getelementsbytagname("td") 

     ' Loop through each cell of the row 
               For Each c In tCells 
     ' Output the contents of the cell to the spreadsheet 
                    Sheet1.Cells(rNum, cNum).Value = c.innertext 

     ' Increase the cNum value so the next time around the data will
     ' output to the column to the right 
                    cNum = cNum + 1 
               Next 

     ' When we switch to the next row of the table, increase the rNum     ' value so we go to the next row of our spreadsheet, and also 
     ' reset back to column number 1 
               rNum = rNum + 1 
               cNum = 1 
          Next 

     ' After we have pulled all the data from the table, click the 
     ' button on the page to go to the next page of the table before      ' starting over again 
          Set btn = ie.document.getelementsbyclassname("stats-table-pagination__next") 
          btn(0).Click 

     ' Wait while the page updates 
          While ie.Busy 
               DoEvents 
          Wend 
     ' If this while loop isn't working and the code is proceeding
     ' before the table has updated to the new page, then you can 
     ' add another two second wait to the code after the loop
     ' Since this will run 6 times, that will add 12 seconds to the
     ' running time of the code. Just uncomment the line below to 
     ' add the extra wait time.
          'Application.Wait DateAdd("s", 2, Now)
     Next 
     
     ' Quit the internet explorer application 
     ie.Quit 

     ' Clear the ie object. This probably isn't necessary, but helps 
     ' clean things up 
     Set ie = Nothing 

End Sub

You can obtain a copy of the Excel file with this code here.

 

Advertisements

6 thoughts on “Scraping HTML Table with VBA

Add yours

      1. Did you try what was suggested in my April 8th reply? Did you try pausing for 5 or even 10 seconds (not practical for any sort of mass text scraping) to ensure the page loaded? Because most likely what is happening is the code is continuing before the page has fully loaded.

        Like

  1. numPages = temp(0).innertext . please help me to fix the error- “object variable or with block variable not set ” for above code

    Like

    1. I believe what is happening is the “While ie.Busy DoEvents Wend” loop is failing us. The point of that loop is to pause and wait for the page to finish loading before proceeding. I find the method to be very finicky and not super reliable, but it remains what most people suggest to do to wait for a page load. So basically, the page is saying it is done loading, so our code proceeds with setting the temp variable to: “ie.document.getelementsbyclassname(“stats-table-pagination__info”)” and that element hasn’t loaded yet. So when we go to retreive the first item from temp, it is saying there is nothing there (i.e. object variable not set). So the solution? Well, I am still trying to figure out the most full-proof way to wait for a page to load without waiting any longer than absolutely necessary, but a simple hack you can do is add this line right after the While loop ends:
      Application.Wait DateAdd(“s”, 3, Now)

      This will make the code pause for another 3 seconds (which should be enough time for the page to completely load). If the error still pops up, you could change the 3 to a 5 to make it wait 5 seconds which may be more appropriate for a slower internet connection.

      I’ve updated the post to include the Application.Wait DateAdd(“s”, 3, Now) line to try and prevent this from happening to other people going forward.

      Hopefully that fixes things for you.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: