spin the cat: PowerShell downloading links

I’ve started to mess around with PowerShell at home and work and found a nice little test. A forum I go to had a thread full of Wikipedia links I wanted, but there were an awful lot of them.

So, a few Google searches later and some experiments later I had a script that will download all Wikipedia links from a set of URLs loaded into a text file.

  1: cd C:\Users\Jeff\Documents\Development\Powershell\DownloadLinks
  2: $pages = Get-Content ".\pages.txt"
  3: $file = ".\links.txt"
  4: 
  5: foreach ($page in $pages)
  6: {
  7:     $ie = new-object -com "InternetExplorer.Application"
  8:     $ie.Navigate($page)
  9:     While ($ie.Busy) { Start-Sleep -Milliseconds 400 }
 10:     
 11:     $doc = $ie.Document
 12:     $doc.getElementsByTagName('a') | `
 13:         Where-Object { $_.href -ne $null } | `
 14:         Where-Object { $_.href.Contains("wikipedia") -eq "true" } | `
 15:         Select-Object -ExpandProperty href | `
 16:         Out-File -filepath $file -append
 17: }

And the list of pages to process

http://forum.rpg.net/showthread.php?t=279379
http://forum.rpg.net/showthread.php?t=279379&page=2
http://forum.rpg.net/showthread.php?t=279379&page=3
http://forum.rpg.net/showthread.php?t=279379&page=4
http://forum.rpg.net/showthread.php?t=279379&page=5
http://forum.rpg.net/showthread.php?t=279379&page=6
http://forum.rpg.net/showthread.php?t=279379&page=7
http://forum.rpg.net/showthread.php?t=279379&page=8
http://forum.rpg.net/showthread.php?t=279379&page=9
http://forum.rpg.net/showthread.php?t=279379&page=10

Pages used to build this

http://powershell.com/cs/blogs/tips/archive/2010/05/03/use-null-to-identify-empty-data.aspx

http://powershell.com/cs/blogs/tobias/archive/2010/03/17/downloading-images-from-webpages.aspx

http://www.computerperformance.co.uk/powershell/powershell_file_outfile.htm

http://technet.microsoft.com/en-us/library/ff730958.aspx

http://www.orcsweb.com/blog/jeremy/powershell-pearl-filter-by-contained-text/

http://forum.rpg.net/showthread.php?t=279379

spin the cat

Tuesday, August 31, 2010

PowerShell downloading links

No comments:

Post a Comment