When working with HTML, it is common to require advanced regular expressions that separate the content you care about from the content you don’t. Aperfect example of this is extracting all the HTML links from a web page.
Links come in many forms, depending on how lenient you want to be. They may be wellformed according to the various HTML standards. They may use relative paths, or they may use absolute paths. They may place double quotes around the URL, or they may place single quotes around the URL. If you’re really unlucky, they may accidentally include quotes on only one side of the URL.
Example 92 demonstrates some approaches for dealing with this type of advanced parsing task. Given a web page that you’ve downloaded from the Internet, it extracts all links from the page and returns a list of the URLs in that page. It also fixes URLs that were originally written as relative URLs (for example, /file.zip) to include the server from which they originated.
Example 92. GetPageUrls.ps1
############################################################################## ## GetPageUrls.ps1 ## ## Parse all of the URLs out of a given file. ## ## Example: ## GetPageUrls microsoft.html http://www.microsoft.com ## ############################################################################## param(
## The filename to parse [string] $filename = $(throw "Please specify a filename."),
## The URL from which you downloaded the page. ## For example, http://www.microsoft.com [string] $base = $(throw "Please specify a base URL."),
## The Regular Expression pattern with which to filter ## the returned URLs [string] $pattern = ".*"
)
## Load the System.Web DLL so that we can decode URLs [void] [Reflection.Assembly]::LoadWithPartialName("System.Web")
## Defines the regular expression that will parse an URL ## out of an anchor tag. $regex = "<\s*a\s*[^>]*?href\s*=\s*[`"']*([^`"'>]+)[^>]*?>"
## Parse the file for links function Main {
## Do some minimal source URL fixups, by switching backslashes to ## forward slashes $base = $base.Replace("\", "/")
if($base.IndexOf("://") lt 0)
Example 92. GetPageUrls.ps1 (continued)
{ throw "Please specify a base URL in the form of " + "http://server/path_to_file/file.html" }
## Determine the server from which the file originated. This will ## help us resolve links such as "/somefile.zip" $base = $base.Substring(0,$base.LastIndexOf("/") + 1) $baseSlash = $base.IndexOf("/", $base.IndexOf("://") + 3) $domain = $base.Substring(0, $baseSlash)
## Put all of the file content into a big string, and ## get the regular expression matches $content = [String]::Join(' ', (getcontent $filename)) $contentMatches = @(GetMatches $content $regex)
foreach($contentMatch in $contentMatches) { if(not ($contentMatch match $pattern)) { continue }
$contentMatch = $contentMatch.Replace("\", "/")
## Hrefs may look like: ## ./file ## file ## ../../../file ## /file ## url ## We'll keep all of the relative paths, as they will resolve. ## We only need to resolve the ones pointing to the root. if($contentMatch.IndexOf("://") gt 0) {
$url = $contentMatch } elseif($contentMatch[0] eq "/") {
$url = "$domain$contentMatch" } else {
$url = "$base$contentMatch" $url = $url.Replace("/./", "/") }
## Return the URL, after first removing any HTML entities [System.Web.HttpUtility]::HtmlDecode($url) } }
function GetMatches([string] $content, [string] $regex)
Example 92. GetPageUrls.ps1 (continued)
{ $returnMatches = NewObject System.Collections.ArrayList
## Match the regular expression against the content, and ## add all trimmed matches to our return list $resultingMatches = [Regex]::Matches($content, $regex, "IgnoreCase") foreach($match in $resultingMatches) {
$cleanedMatch = $match.Groups[1].Value.Trim() [void] $returnMatches.Add($cleanedMatch) }
$returnMatches }
. Main
Program: Connect-WebService
Although screen scraping (parsing the HTML of a web page) is the most common way to obtain data from the Internet, web services are becoming increasingly common. Web services provide a significant advantage over HTML parsing, as they are much less likely to break when the web designer changes minor features in a design.
The only benefit to web services isn’t their more stable interface, however. When working with web services, the .NET Framework lets you generate proxies that let you interact with the web service as easily as you would work with a regular .NET object. That is because to you, the web service user, these proxies act almost exactly the same as any other .NET object. To call a method on the web service, simply call a method on the proxy.
The primary differences you will notice when working with a web service proxy (as opposed to a regular .NET object) are the speed and Internet connectivity requirements. Depending on conditions, a
method call on a web service proxy could easily take several seconds to complete. If your computer (or the remote computer) experiences network difficulties, the call might even return a network error message (such as a timeout) instead of the information you had hoped for.
Example 93 lets you connect to a remote web service if you know the location of its service description file (WSDL). It generates the web service proxy for you, allowing you to interact with it as you would any other .NET object.
Example 93. ConnectWebService.ps1
############################################################################## ## ConnectWebService.ps1 ## ## Connect to a given web service, and create a type that allows you to ## interact with that web service. ## ## Example: ## ## $wsdl = "http://terraserver.microsoft.com/TerraService2.asmx?WSDL" ## $terraServer = ConnectWebService $wsdl ## $place = NewObject Place ## $place.City = "Redmond" ## $place.State = "WA" ## $place.Country = "USA" ## $facts = $terraserver.GetPlaceFacts($place) ## $facts.Center ############################################################################## param(
[string] $wsdlLocation = $(throw "Please specify a WSDL location"), [string] $namespace, [Switch] $requiresAuthentication)
## Create the web service cache, if it doesn't already exist if(not (TestPath Variable:\Lee.Holmes.WebServiceCache)) {
${GLOBAL:Lee.Holmes.WebServiceCache} = @{} }
## Check if there was an instance from a previous connection to ## this web service. If so, return that instead. $oldInstance = ${GLOBAL:Lee.Holmes.WebServiceCache}[$wsdlLocation] if($oldInstance) {
$oldInstance return }
## Load the required Web Services DLL [void] [Reflection.Assembly]::LoadWithPartialName("System.Web.Services")
## Download the WSDL for the service, and create a service description from ## it. $wc = NewObject System.Net.WebClient
if($requiresAuthentication) { $wc.UseDefaultCredentials = $true }
Example 93. ConnectWebService.ps1 (continued)
$wsdlStream = $wc.OpenRead($wsdlLocation)
## Ensure that we were able to fetch the WSDL if(not (TestPath Variable:\wsdlStream)) {
return }
$serviceDescription = [Web.Services.Description.ServiceDescription]::Read($wsdlStream) $wsdlStream.Close()
## Ensure that we were able to read the WSDL into a service description if(not (TestPath Variable:\serviceDescription)) {
return }
## Import the web service into a CodeDom $serviceNamespace = NewObject System.CodeDom.CodeNamespace if($namespace) {
$serviceNamespace.Name = $namespace }
$codeCompileUnit = NewObject System.CodeDom.CodeCompileUnit $serviceDescriptionImporter = NewObject Web.Services.Description.ServiceDescriptionImporter $serviceDescriptionImporter.AddServiceDescription(
$serviceDescription, $null, $null) [void] $codeCompileUnit.Namespaces.Add($serviceNamespace) [void] $serviceDescriptionImporter.Import(
$serviceNamespace, $codeCompileUnit)
## Generate the code from that CodeDom into a string $generatedCode = NewObject Text.StringBuilder $stringWriter = NewObject IO.StringWriter $generatedCode $provider = NewObject Microsoft.CSharp.CSharpCodeProvider $provider.GenerateCodeFromCompileUnit($codeCompileUnit, $stringWriter, $null)
## Compile the source code. $references = @("System.dll", "System.Web.Services.dll", "System.Xml.dll") $compilerParameters = NewObject System.CodeDom.Compiler.CompilerParameters $compilerParameters.ReferencedAssemblies.AddRange($references) $compilerParameters.GenerateInMemory = $true
$compilerResults = $provider.CompileAssemblyFromSource($compilerParameters, $generatedCode)
## Write any errors if generated. if($compilerResults.Errors.Count gt 0)
Example 93. ConnectWebService.ps1 (continued)
{ $errorLines = "" foreach($error in $compilerResults.Errors) {
$errorLines += "`n`t" + $error.Line + ":`t" + $error.ErrorText }
WriteError $errorLines
return } ## There were no errors. Create the web service object and return it. else {
## Get the assembly that we just compiled $assembly = $compilerResults.CompiledAssembly
## Find the type that had the WebServiceBindingAttribute. ## There may be other "helper types" in this file, but they will ## not have this attribute $type = $assembly.GetTypes() |
WhereObject { $_.GetCustomAttributes( [System.Web.Services.WebServiceBindingAttribute], $false) }
if(not $type)
{ WriteError "Could not generate web service proxy." return
}
## Create an instance of the type, store it in the cache, ## and return it to the user. $instance = $assembly.CreateInstance($type)
## Many services that support authentication also require it on the ## resulting objects if($requiresAuthentication) {
if(@($instance.PsObject.Properties | where { $_.Name eq "UseDefaultCredentials" }).Count eq 1) { $instance.UseDefaultCredentials = $true } }
${GLOBAL:Lee.Holmes.WebServiceCache}[$wsdlLocation] = $instance
$instance }
Export Command Output As a Web Page
Problem
You want to export the results of a command as a web page so that you can post it to a web server.
Solution
Use PowerShell’s ConvertToHtml cmdlet to convert command output into a web page. For example, to create a quick HTML summary of PowerShell’s commands:
PS >$filename = "c:\temp\help.html" PS > PS >$commands = GetCommand | Where { $_.CommandType ne "Alias" } PS >$summary = $commands | GetHelp | Select Name,Synopsis PS >$summary | ConvertToHtml | SetContent $filename
Discussion
When you use the ConvertToHtml cmdlet to export command output to a file, PowerShell generates an HTML table that represents the command output. In the table, it creates a row for each object that you provide. For each row, PowerShell creates columns to represent the values of your object’s properties.
The ConvertToHtml cmdlet lets you customize this table to some degree through parameters that allow you to add custom content to the head and body of the resulting page.
For more information about the ConvertToHtml cmdlet, type GetHelp ConvertToHtml.