Powershell fiddling around Web scraping, Twitter – User Profiles, Images and much more
I’m Big Fan of REST API’s , they are efficient, reliable and fun.
Recently I have been playing with Twitter REST API’s and was thinking is it possible to get the required information from Twitter without using the API? without setting up the authentication model (OAuth tokens) or connecting to right endpoint ?
Point of this post is to see what we can achieve with shell scripting and NOT reflecting the idea that API are bad in any way.
INITIAL HYPOTHESIS :
All information on twitter webpage are under some specific HTML tags, and can be easily extracted by parsing the HTML content returned in response of the query to twitter Webpage. Let’s suppose https://twitter.com/followers
But I would require a Web Control mechanism (Property or Method) to infinitely scroll web page so that full list of user profiles are populated which only appear when you scroll down to the bottom of the page. We can’t use Invoke-WebRequest as it doesn’t support any functionality to scroll the webpage URL it is requesting.
Once we figure out this and the web page is fully populated, we can use web data scraping techniques to get information from any web page(Followers, Following, Timeline etc)
STEPS BREAKDOWN :
Following are the steps how I managed to data harvest User profiles of all my followers on twitter –
- Create InternetExplorer.Application COM (Component Object Model) objects and navigate to the URL; wait until the URL has been properly loaded in IE.
- Programmatically scroll to the bottom of the page so that all user profiles are populated.
- Once all profiles are populated on Internet explorer window use the internet Explorer COM Object to access the web page Document (Parsed HTML)
- Filter out required data sitting in specific HTML Tags
- Convert raw filtered data to Powershell objects and generate presentable output on screen
- It’s prerequisite to login to Internet Explorer and check ‘Remember me’ checkbox , so that IE opens your twitter profile by default when the script is running.
- Infinite scrolling requires to be stopped when all data is populated, for me max 30 secs window worked, but it may change depending upon the length of page under your profile and speed of your internet connection.
HOW TO RUN :
Run the function like I did in the animation below and it will Data scrape user Profile information from your twitter webpage
OK, let’s check how many user profiles I was able to harvest
Perfect! that looks good 🙂 and exactly the number of Followers I have (286 only! what a shame 😀 )
Now let’s filter out the User profiles of Microsoft MVP (Most Valuable professionals) awardees from my followers. I guess most of them have the “MVP” keyword in the user bio , so a simple “where” keyword would do the work for us like in the following screenshot.
Though I’m harvesting only four properties from the webpage, You can tweak the script as desired to get more information from the WebPage HTML content.
Since, I know who are the MVP’s following myself on twitter, how about downloading their Profile picture from twitter to my local drive using the User profile data we have harvested. To achieve this you’ve to follow steps in below animation.
Hope you’ll find it fun and fiddle around more with Powershell ! 😉 thanks for stopping by.