INTRODUCTION to Parsing HTML :
If you are familiar with PowerShell Invoke-WebRequest cmdlet then you must be aware that you get a parsed HTML from the requested Web URL. DOM structure of the web page is utilized to get access to HTML elements of the web page or Parsing HTML, like in the below animation –
PROBLEM :
What if we have HTML files are locally present on your machine or HTML content in form of string? Do we have any mechanism in place to Parse the local file/string?
SOLUTION :
Well the answer is – yes we can! 🙂
Microsoft provides HTML document class in .Net framework class library, which has a Write() method to write HTML Document using DOM 2 (Document Object Model Level 2)
APPROACH 1 : From a String
Instantiate HTML document class object like in below animation and parse the HTML content as a string to access the HTML Elements.
APPROACH 2 : From a File
Similarly we can parse HTML document from a local HTML file
NOTE :
Even the parsed HTML from Invoke-Webrequest has the type HTML Document Class
That was all on today’s #Powershell Tip, Thanks for reading! 🙂
Prateek Singh
Related posts
7 Comments
Leave a Reply to Herb MartinCancel reply
Categories
Author of Books
Awards
Open Sourced Projects
Author at
Blog Roll
Mike F RobbinsDamien Van Robaeys
Stéphane van Gulick
Kevin Marquette
Adam Bertram
Stephanos Constantinou
Francois-Xavier Cat
Ravikanth Chaganti
Roman Levchenko
Blog Stats
- 1,133,283 People Reached
[…] on January 23, 2017 submitted by /u/Prateeksingh1590 [link] [comments] Leave a […]
Thanks for this mate, it was really helpful!
Jeez…. what a difficult way to show some code…. It is unreadable this way – goes too fast.
Do you have any idea how annoying the animated GIF’s are?
Had to put the animated gif into IRFanView and stop the animation (key-> G) to read the very USEFUL information.
This is amazing, helped me a LOT!
@prateek – I was really hopeful this was going to help me resolve a problem of post processing some HTML documents. We have quite complex HTML local files where the main code generation tool erroneously places some HTML character codes into the document (which then breaks the intention when the text is copy and pasted into another application). I therefore wanted to parse the HTML to replace some text within some tags, specifically I was looking to grab the `
...
tags.However, `$html.all.tags(“code”) | % innertext` returns nothing. In fact, no matter what tag I specify, I get nothing.
So, I simply tried to use the code in a Windows PS 5.1 session (Windows 10), but that too returned nothing.