Scrape Website Data into Excel using VBA

I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(...

I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(dot)renewableuk(dot)com. Please also read the privacy policy of the website before mining data.


Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project

Requirements:
You need to add a reference, Microsoft HTML Object Library on your VBA project.

Usage:
You can call the ProcessWeb() sub directly by pressing F5 on the Microsoft Visual Basic Window.
Or you can add a button on your excel worksheet then assign ProcessWeb() as the macro.

VBA CODE:

1
Function ScrapeWebPage(ByVal URL As String)<br>    Dim HTMLDoc As New HTMLDocument<br>    Dim tmpDoc As New HTMLDocument<br>    <br>    Dim WS As Worksheet<br>    <br>    Dim i As Integer, row As Integer<br>    Dim File As Integer<br>    Dim Filename As String<br>    Dim DataLine As String<br>    File = FreeFile<br>    <br>    Filename = ActiveWorkbook.Path & "\html.log"<br>    <br>    Set WS = Sheets("DATA")<br>    <br>    'create new XMLHTTP Object<br>    Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")<br>    XMLHttpRequest.Open "GET", URL, False<br>    XMLHttpRequest.send<br>    <br>    While XMLHttpRequest.readyState <> 4<br>        DoEvents<br>    Wend<br><br>    With HTMLDoc.body<br>        'Set HTML Document<br>        .innerHTML = XMLHttpRequest.responseText<br>        <br>        'Get only Order List Tag of HTML Document<br>Set orderedlists = .getElementsByTagName("ol")<br>        <br>        'Reset the Document to the HTML of the second ordered list element<br>        'where we only need to extract the data<br>        .innerHTML = orderedlists(1).innerHTML<br>    <br>        'Now, we'll get the list items<br>    Set ListItems = .getElementsByTagName("li")<br>    <br>    'Open our log file for output stream<br>    Open Filename For Output As #File<br>    For Each li In ListItems<br>        <br>        With tmpDoc.body<br>            'Set the temp doc<br>            .innerHTML = li.innerHTML<br>            <br>            'There are about 10 columns, so there are 10 p's<br>            Set ps = .getElementsByTagName("p")<br>            <br>            For Each p In ps<br>                'Print only the text, excluding the tags<br>                Print #File, p.innerText<br>            Next<br>            <br>        End With<br>    Next<br>    'close the file<br>    Close #File<br>    <br>    End With<br>    <br>    'Open the file again, we'll use it to retrieve each data lines<br>    Open Filename For Input As #File<br>    <br>    'Last row of the worksheet<br>    row = lastRow + 1<br>    <br>    While Not EOF(File)<br>        For i = 1 To 10<br>            'read the data from the log file<br>            Line Input #File, DataLine<br>            <br>            'Put the data on the 1st to 10th column<br>            WS.Cells(row, i).Value = DataLine<br>            <br>        Next i<br>        row = row + 1<br>    Wend<br>    Close #File<br>    <br>End Function<br><br>'Get the total number pages we need to scrape<br>Function totalPage() As Integer<br>    Dim HTMLDoc As New HTMLDocument<br>    Dim tmpDoc As New HTMLDocument<br>    Dim html As String<br>    Dim mask As String<br>    Dim URL As String<br>    <br>    URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm"<br>        <br>    Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")<br>    XMLHttpRequest.Open "GET", URL, False<br>    XMLHttpRequest.send<br>    <br>    html = XMLHttpRequest.responseText<br>    <br>    With HTMLDoc.body<br>        .innerHTML = Mid(html, InStr(1, html, "<span chr="" class=" & Chr(34) & " page-dotted="">"), 300)<br>        mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "</span>") - 2, 2)<br>    End With<br>    <br>    totalPage = mask<br>    <br>End Function<br><br>Function lastRow() As Long<br>    lastRow = Range("A65536").End(xlUp).row<br>End Function<br><br>Sub ProcessWeb()<br>    Dim URL As String<br>    Dim i As Integer<br>    <br>    Range("2:2", Selection.End(xlDown)).ClearContents<br>    Range("A2").Select<br>    <br>    Application.ScreenUpdating = False<br>    Application.Cursor = xlWait<br>    <br>    URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm/page/"<br>        <br>    For i = 1 To totalPage<br>        ScrapeWebPage URL & i<br>        Application.StatusBar = "Please wait while processing page " & i & " of " & totalPage & "..."<br>    Next i<br>    <br>    Application.ScreenUpdating = True<br>    Application.Cursor = xlDefault<br>    Application.StatusBar = ""<br>    <br>    MsgBox "Data Extraction is Done!"<br>    <br>End Sub<br>

Post a Comment

emo-but-icon
:noprob:
:smile:
:shy:
:trope:
:sneered:
:happy:
:escort:
:rapt:
:love:
:heart:
:angry:
:hate:
:sad:
:sigh:
:disappointed:
:cry:
:fear:
:surprise:
:unbelieve:
:shit:
:like:
:dislike:
:clap:
:cuff:
:fist:
:ok:
:file:
:link:
:place:
:contact:

Home item

Popular Posts