[ Powershell ] Data Harvesting all dictionary words for each alphabet from Web
I am working on a personal project where in I require common/popular dictionary words for lookups and performance is a concern.
INITIAL THOUGHTS :
Figure out a way in which I can use In-built dictionary on Microsoft Word, but couldn’t find much on google search ( May be I didn’t tried enough :P) – Not gonna work!
What if I use a REST API of any online dictionary, but API calls would be slow and would end up in performance issues – Nah not good enough!
How about Web Scraping/Harvesting all Dictionary words from a popular online dictionary and storing them locally which will make it easily accessible and won’t be a setback on performance – This may work 🙂
Alright! let’s find an online dictionary and after some google search I landed on this website which has 3000 core vocabulary words starting with all 26 alphabets.
Moreover they had implemented a pagination logic such that if you click next on a page,
It will open a new page with more words from that alphabet and your URL appends a page number, like below
again on clicking next, page number keeps incrementing
Again and goes on.
Perfect! I believe I can implement a Powershell Function to traverse the Website’s pages using the pagination logic for each alphabet and then extract the words. Which will give me around 3000 core vocabulary words and would be good enough 🙂 So let’s start!
HOW IT WORKS :
URL syntax for this website is like –
http://learnersdictionary.com/3000-words/alpha/a/2 http://learnersdictionary.com/3000-words/alpha/b/5 http://learnersdictionary.com/3000-words/alpha/c/3
Making a loop that increments the page number in every iteration, in order to generate the next URL for an alphabet, but these pages has to be finite in number and we still have to figure out when to break/stop this loop.
OK! Go to last page of any alphabet and you may observe that the Pagination link to “Next page” is not present on the last page for all Alphabets. Eureka! This means I can access the HTML DOM(Document Object Model) structure of every web page and if there is no link for the “Next page” that means you are on last page. Simple isn’t it? 🙂
Having figured out how to stop the iterations, now let’s gather the data simply by an –
in an iterative manner, to get parsedHTML as a intermediate result, filter only the HTML tags in which data is sitting, like in the below screenshot.
Do some Data wrangling (conversion of data from raw to more consumable format) and generate result as a output, keep iterating until you are on the last page of an alphabet.
Wrap this logic in a Powershell Function – “Get-DictionaryWord” and Run the below script to harvest dictionary words for all 26 Alphabets. Generating around 3000 words
I’m saving dictionary words in JSON files for each starting alphabet which can be access easily and quickly.
I hosted these dictionary JSON Files on my Github repo here and I will be using these in some more awesome powershell functions in my coming blog posts, so stay tuned ! 😉