Problem
You want to download a web page from the Internet and work with the content as a plain string.
Solution
Use the DownloadString() method from the .NET Framework’s System.Net.WebClient class to download a web page or plain text file into a string.
PS >$source = "http://blogs.msdn.com/powershell/rss.xml" PS > PS >$wc = NewObject System.Net.WebClient PS >$content = $wc.DownloadString($source)
Discussion
Although web services are becoming increasingly popular, they are still far less common than web pages that display useful data. Because of this, retrieving data from services on the Internet often comes by means of screen scraping: downloading the HTML of the web page and then carefully separating out the content you want from the vast majority of the content that you do not.
The technique of screen scraping has been around much longer than the Internet! As long as computer systems have generated output designed primarily for humans, screen scraping tools have risen to
make this output available to other computer programs.
Unfortunately, screen scraping is an errorprone way to extract content. If the web page authors change the underlying HTML, your code will usually stop working correctly. If the site’s HTML is written as valid XHTML, you may be able to use PowerShell’s built in XML support to more easily parse the content.
Despite its fragility, pure screen scraping is often the only alternative. In Example 91, you use this approach to easily fetch Encarta “Instant Answers” from MSN Search. If the script no longer works when you run it, I apologize—although it does demonstrate the perils of screen scraping.
Example 91. GetAnswer.ps1
############################################################################## ## GetAnswer.ps1 ## ## Use Encarta's Instant Answers to answer your question ## ## Example: ## GetAnswer "What is the population of China?" ############################################################################## param([string] $question = $( throw "Please ask a question."))
function Main
{ ## Load the System.Web.HttpUtility DLL, to let us URLEncode [void] [System.Reflection.Assembly]::LoadWithPartialName("System.Web")
## Get the web page into a single string with newlines between ## the lines. $encoded = [System.Web.HttpUtility]::UrlEncode($question) $url = "http://search.live.com/results.aspx?q=$encoded" $text = (newobject System.Net.WebClient).DownloadString($url)
Example 91. GetAnswer.ps1 (continued)
## Get the answer with annotations $startIndex = $text.IndexOf('<span class="answer_header">') $endIndex = $text.IndexOf('function YNC')
## If we found a result, then filter the result if(($startIndex ge 0) and ($endIndex ge 0)) {
$partialText = $text.Substring($startIndex, $endIndex $startIndex)
## Very fragile screen scraping here $pattern = '<script.+?<div (id="results"|class="answer_fact_body")>' $partialText = $partialText replace $pattern,"`n" $partialText = $partialText replace '<span class="attr.?.?.?">',"`n" $partialText = $partialText replace '<BR ?/>',"`n"
$partialText = cleanhtml $partialText $partialText = $partialText replace "`n`n", "`n"
"`n" + $partialText.Trim() } else {
"`nNo answer found." } }
## Clean HTML from a text chunk function cleanhtml ($htmlInput) {
$tempString = [Regex]::Replace($htmlInput, "<[^>]*>", "") $tempString.Replace("  ", "") }
. Main