24/7/365 Support

Download a Web Page from the Internet

Problem

You want to download a web page from the Internet and work with the content as a plain string.

Solution

Use the DownloadString() method from the .NET Framework’s System.Net.WebClient class to download a web page or plain text file into a string.

PS >$source = "http://blogs.msdn.com/powershell/rss.xml" PS > PS >$wc = NewObject System.Net.WebClient PS >$content = $wc.DownloadString($source)

Discussion

Although web services are becoming increasingly popular, they are still far less common than web pages that display useful data. Because of this, retrieving data from services on the Internet often comes by means of screen scraping: downloading the HTML of the web page and then carefully separating out the content you want from the vast majority of the content that you do not.

The technique of screen scraping has been around much longer than the Internet! As long as computer systems have generated output designed primarily for humans, screen scraping tools have risen to

make this output available to other computer programs.

Unfortunately, screen scraping is an errorprone way to extract content. If the web page authors change the underlying HTML, your code will usually stop working correctly. If the site’s HTML is written as valid XHTML, you may be able to use PowerShell’s built in XML support to more easily parse the content.

Despite its fragility, pure screen scraping is often the only alternative. In Example 91, you use this approach to easily fetch Encarta “Instant Answers” from MSN Search. If the script no longer works when you run it, I apologize—although it does demonstrate the perils of screen scraping.

Example 91. GetAnswer.ps1

############################################################################## ## GetAnswer.ps1 ## ## Use Encarta's Instant Answers to answer your question ## ## Example: ## GetAnswer "What is the population of China?" ############################################################################## param([string] $question = $( throw "Please ask a question."))

function Main

{ ## Load the System.Web.HttpUtility DLL, to let us URLEncode [void] [System.Reflection.Assembly]::LoadWithPartialName("System.Web")

## Get the web page into a single string with newlines between ## the lines. $encoded = [System.Web.HttpUtility]::UrlEncode($question) $url = "http://search.live.com/results.aspx?q=$encoded" $text = (newobject System.Net.WebClient).DownloadString($url)

Example 91. GetAnswer.ps1 (continued)

## Get the answer with annotations $startIndex = $text.IndexOf('<span class="answer_header">') $endIndex = $text.IndexOf('function YNC')

## If we found a result, then filter the result if(($startIndex ge 0) and ($endIndex ge 0)) {

$partialText = $text.Substring($startIndex, $endIndex $startIndex)

## Very fragile screen scraping here $pattern = '<script.+?<div (id="results"|class="answer_fact_body")>' $partialText = $partialText replace $pattern,"`n" $partialText = $partialText replace '<span class="attr.?.?.?">',"`n" $partialText = $partialText replace '<BR ?/>',"`n"

$partialText = cleanhtml $partialText $partialText = $partialText replace "`n`n", "`n"

"`n" + $partialText.Trim() } else {

"`nNo answer found." } }

## Clean HTML from a text chunk function cleanhtml ($htmlInput) {

$tempString = [Regex]::Replace($htmlInput, "<[^>]*>", "") $tempString.Replace("&nbsp&nbsp", "") }

. Main

Help Category:

Get Windows Dedicated Server

Only reading will not help you, you have to practice it! So get it now.

Processor RAM Storage Server Detail
Intel Atom C2350 1.7 GHz 2c/2t 4 GB DDR3 1× 1 TB (HDD SATA) Configure Server
Intel Atom C2350 1.7 GHz 2c/2t 4 GB DDR3 1× 128 GB (SSD SATA) Configure Server
Intel Atom C2750 2.4 GHz 8c/8t 8 GB DDR3 1× 1 TB (HDD SATA) Configure Server
Intel Xeon E3-1230 v2 3.3 GHz 4c/8t 16 GB DDR3 1× 256 GB (SSD SATA) Configure Server
Intel Atom C2350 1.7 GHz 2c/2t 4 GB DDR3 1× 250 GB (SSD SATA) Configure Server

What Our Clients Say