INTRODUCTION – Text Summary with PowerShell :
This is a PowerShell script to summarize long text document(s) depending upon your chosen word limit, it utilizes an algorithm which looks for parameters like Important words and Common content to score each sentence in order to generate a summary of the highest scored sentences in the sequence of there occurrence in the content.
HOW IT WORKS :
-
-
GET THE CONTENT :
Get contents of a File or from ClipBoard and store it in a temporary variable
-
SPLIT INTO SENTENCES :
Split the complete document into sentences using Newline string object and remove empty or blank lines.
-
RANK EACH SENTENCE :
Once you’ve all sentences, rank each sentence in content, with scores mainly depending on two main following criterias –
- IMPORTANT WORDS : To identify important words in the content, calculate the frequency distribution for each word in the content, remove words smaller than 3 alphabets (Example – “The”,”Are”, “For”, “As” etc) and group them, sort them by count and select top 10 Important words and Give them a weight in multiples of Frequency of that word in content
- COMMON WORDS IN EACH SENTENCE: Now in order to get an idea how many words which in each sentence is common to all others sentences, we find Intersection of each sentence to every other sentence in the content.i.e. Scoring each Sentence on basis of words common in every other sentence, more a sentence has common words compared to all other sentences, more it defines/summarizes the complete document
- SELECT THE BEST SENTENCES :
Once we’ve scored each sentences using above to parameters, we should add these individual scores ( CommonContentScore + ImportanceScore = SentenceScore ) and sort sentences from highest score to lowest score. Count the words in each sentence and select only highest scored ones within the word limit.NOTE: It is a must to order Best sentences in the sequence of their actual occurrences in content so that they make more sense. Otherwise, they will look jumbled and won’t be like a summary. - OUTPUT SUMMARY: Display best sentences on the screen in form of a paragraph, which will be the summary of the complete document
-
INSTALLATION:
The module is available on Powershell gallery and you can install it directly if you have PowerShell V5
SCRIPT:
You can also download the module from TechNet or from my GitHub Repository here
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<#.Synopsis | |
Returns Summary of a text document. | |
.DESCRIPTION | |
Returns the summary of a text document passed to it, depending upon your chosen word limit (Default 100 words). | |
.PARAMETER File | |
Text File with the content to summarize. | |
.PARAMETER WordLimit | |
Maximum number of words to be allowed in Summary | |
.EXAMPLE | |
PS Root\> Get-Summary -File .\Document.txt | |
“Since my letter [of October 28], the FBI investigation team has been working round-the-clock to process and review a large volume of emails from a device obtained in connection with an unrelated criminal investigation,” Mr. Comey said. It was reported that there were 6,50,000 emails on that laptop. | |
Provide a path to a text file in the cmdlet and it will generate a summary for you, by default it summarizes upto less than or equal to 100 words. | |
.EXAMPLE | |
PS Root\> Get-Summary -File D:\Document.txt -WordLimit 150 | |
“Since my letter [of October 28], the FBI investigation team has been working round-the-clock to process and review a large volume of emails from a device obtained in connection with an unrelated criminal investigation,” Mr. Comey said. It was reported that there were 6,50,000 emails on that laptop. Democratic leader Nancy Pelosi said that after yet another exhaustive review of emails to and from Ms. Clinton, the FBI has again given her a clean bill of health. “The FBI’s findings from its criminal investigation of Hillary Clinton’s secret email server were a damning and unprecedented indictment of her judgment. The FBI found evidence Clinton broke the law, that she placed highly classified national security information at risk and repeatedly lied to the American people about her reckless conduct,” he said. | |
You can also provide a value to '-WordLimit' parameter to increase or decrease the length of summary. | |
.EXAMPLE | |
PS Root\> Get-Summary -File D:\Document.txt -Verbose | |
“Since my letter [of October 28], the FBI investigation team has been working round-the-clock to process and review a large volume of emails from a device obtained in connection with an unrelated criminal investigation,” Mr. Comey said. It was reported that there were 6,50,000 emails on that laptop. | |
VERBOSE: Content has been summarized from 875 to 48 Words | |
Mention a '-Verbose' switch to view summarization ratio, i.e, Original number of words to number of words in Summary. | |
.EXAMPLE | |
PS Root\> Get-Summary -FromClipBoard -WordLimit 50 | |
Indian Prime Minister Narendra Modi has won the online reader’s poll for TIME Person of the Year, beating out other world leaders, artists and politicians as the most . | |
Use -FromClipboard switch to summarize the content copied to clipboard | |
.INPUTS | |
None. You cannot pipe objects to Get-Summary. | |
.LINK | |
Get-Content | |
.LINK | |
http://RidiCurious.com | |
.NOTES | |
Author : Prateek Singh | |
Twitter : @SinghPrateik | |
Blog : http://RidiCurious.com | |
#> | |
Function Get-Summary | |
{ | |
[cmdletbinding()] | |
[Alias('Summary')] | |
[OutputType([String])] | |
Param( | |
[Parameter(Position = 0)] [String] $File, | |
[Parameter(Position = 1)] [Int] $WordLimit = 100, | |
[switch] $FromClipBoard | |
) | |
Begin | |
{ | |
If ($File) | |
{ | |
$Content = Get-Content $File | |
} | |
elseif ($FromClipBoard) | |
{ | |
Add-Type -Assembly PresentationCore | |
$Content = [Windows.clipboard]::GetText() | |
} | |
else | |
{ | |
Write-Host "Please provide a file path or copy content to Clipboard" | |
} | |
} | |
Process | |
{ | |
$TotalWords = 0 | |
$Summary = @() | |
#Extracting Best sentences with highest Ranks within the word limit | |
$BestSentences = Foreach ($Item in (Get-SentenceRank $Content | Sort-Object SentenceScore -Descending)) | |
{ | |
#Condition to limit Total word Count | |
$TotalWords += $Item.WordCount | |
If ($TotalWords -gt $WordLimit) | |
{ | |
break | |
} | |
else | |
{ | |
$Item | |
} | |
} | |
If ($BestSentences) | |
{ | |
#Constructing a paragraph with sentences in Chronological order | |
Foreach ($best in (($BestSentences |Sort-Object Linenumber).sentence)) | |
{ | |
If (-not $Best.trim().endswith(".")) | |
{ | |
$Summary += -join ($Best, ".") | |
} | |
else | |
{ | |
$Summary += -join ($Best, "") | |
} | |
} | |
[String]$Summary | |
Write-Verbose "Content has been summarized from $($Content.split(" ").count) to $(([string]$Summary).split(" ").count) Words" | |
} | |
else | |
{ | |
Write-Warning "Word Limit is too small to summarize the document." | |
} | |
} | |
End | |
{ | |
} | |
} | |
Function Get-Intersection($Sentence1, $Sentence2) | |
{ | |
$CommonWords = Compare-Object -ReferenceObject $Sentence1 -DifferenceObject $Sentence2 -IncludeEqual |Where-Object {$_.sideindicator -eq '=='} | Select-Object Inputobject -ExpandProperty Inputobject | |
$CommonWords.Count / ($Sentence1.Count + $Sentence2.Count) / 2 | |
} | |
Function Get-SentenceRank($Content) | |
{ | |
$Sentences = $content -split [environment]::NewLine | Where-Object {$_} | |
$NoOfSentences = $Sentences.count | |
$values = New-Object 'object[,]' $NoOfSentences, $NoOfSentences | |
$CommonContentWeight = New-Object double[] $NoOfSentences | |
#Get important words that where length is greater than 3 to avoid – in, on, of, to, by etc | |
$FrequencyDistribution = $Content.split(" ") |Where-Object {-not [String]::IsNullOrEmpty($_)} | ForEach-Object {[Regex]::Replace($_, '[^a-zA-Z0-9]', '')} |Group-Object |Sort-Object count -Descending | |
$ImportantWords = $FrequencyDistribution |Where-Object {$_.name.length -gt 3} | Select-Object @{n = 'ImportanceWeight'; e = {$_.Count * 0.01}}, @{n = 'ImportantWord'; e = {$_.Name}} -First 10 | |
Foreach ($i in (0..($NoOfSentences – 1))) | |
{ | |
$ImportanceWeight = 0 | |
#Score each Sentence on basis of words common in every other sentence | |
#More a sentence has common words from all other sentences, more it defines the complete document | |
Foreach ($j in (0..($NoOfSentences – 1))) | |
{ | |
$WordsInReferenceSentence = $Sentences[$i].Split(" ") | ForEach-Object {[Regex]::Replace($_, '[^a-zA-Z0-9]', '')} | |
$WordsInDifferenceSentence = $Sentences[$j].Split(" ") | ForEach-Object {[Regex]::Replace($_, '[^a-zA-Z0-9]', '')} | |
$CommonContentWeight[$i] = $CommonContentWeight[$i] + (Get-Intersection $WordsInReferenceSentence $WordsInDifferenceSentence) | |
} | |
Foreach ($Item in $WordsInReferenceSentence |Select-Object -unique) | |
{ | |
#Keep adding ImportanceWeight if an Important word found in the sentence | |
If ($Item -in $ImportantWords.ImportantWord) | |
{ | |
$ImportanceWeight += ($ImportantWords| Where-Object {$_.ImportantWord -eq $Item}).ImportanceWeight | |
} | |
} | |
''| Select-Object @{n = 'LineNumber'; e = {$i}}, @{n = 'SentenceScore'; e = {"{0:N3}" -f ($CommonContentWeight[$i] + $ImportanceWeight)}} , @{n = 'CommonContentScore'; e = {"{0:N3}" -f $CommonContentWeight[$i]}}, @{n = 'ImportanceScore'; e = {$ImportanceWeight}}, @{n = 'WordCount'; e = {($Sentences[$i].Split(" ")).count}} , @{n = 'Sentence'; e = {$Sentences[$i]}} | |
} | |
} |
HOW TO RUN IT :
Once you’ve downloaded the module, import it in your Powershell host session like below
Provide a path to a text file in the cmdlet and it will generate a summary for you, by default it summarizes it to less than equal to 100 words.
You can also provide a value to ‘-WordLimit’ parameter to increase or decrease the length of the summary.
Or, mention a ‘-Verbose’ switch to view summarization ratio, i.e, Original number of words to number of words in Summary.
You can also Use ‘-FromClipboard’ switch to summarize the content copied to the clipboard
If you find this script useful you may also like my previous blog post on highlighting keywords in PowerShell console, which work very well with this module, here is a screenshot of both working together
Prateek Singh
Related posts
3 Comments
Leave a ReplyCancel reply
Categories
Author of Books
Awards
Open Sourced Projects
Author at
Blog Roll
Mike F RobbinsDamien Van Robaeys
Stéphane van Gulick
Kevin Marquette
Adam Bertram
Stephanos Constantinou
Francois-Xavier Cat
Ravikanth Chaganti
Roman Levchenko
Blog Stats
- 1,132,739 People Reached
[…] on December 5, 2016 submitted by /u/Prateeksingh1590 [link] [comments] Leave a […]
[…] on March 15, 2018by admin submitted by /u/Prateeksingh1590 [link] [comments] No comments […]
Hello Prateek,
I like this module a lot but sometimes I get the following error:
Method invocation failed because [System.Char] does not contain a method named ‘Split’.
on these lines:
$WordsInReferenceSentence = $Sentences[$i].Split(” “) | ForEach-Object {[Regex]::Replace($_, ‘[^a-zA-Z0-9]’, ”)}
$WordsInDifferenceSentence = $Sentences[$j].Split(” “) | ForEach-Object {[Regex]::Replace($_, ‘[^a-zA-Z0-9]’, ”)}
I’m sure it has to do with the File parameter but I also get the error with -FromClipBoard switch.
I am creating the File using Set-Content. Any ideas?