I have a large number of files, of which I want to do a word analysis - counting how often each word appears within each file. As the final output I want to have a CSV file with the file names in the heading and for each file two columns - word and the respective count.
file1 word, file1 count, file2 word, file2 count, ....
hello, 4, world, 5, ...
password, 10, save, 2, ...
To achieve this I open each file and save the word count in a hash table. Because each hash table has a different length (different number of unique words) I try to put the results in a data table to export them.
$file = Get-ChildItem -Recurse
$out = New-Object System.Data.DataSet "ResultsSet"
foreach($f in $file){
$pres = $ppt.Presentations.Open($f.FullName, $true, $true, $false)
$id = $f.Name.substring(0,5)
$results = @{} #Hash table for this file
for($i = 4; $i -le $pres.Slides.Count; $i++){
$s = $pres.Slides($i)
$shapes = $s.Shapes
$textBox = $shapes | ?{$_.TextFrame.TextRange.Length -gt 100}
if($textBox -ne $null){
$textBox.TextFrame.TextRange.Words() | %{$_.Text.Trim()} | %{if(-not $results.ContainsKey("$_")){$results.Add($_,1)}else{$results["$_"] += 1 }}
}
}
$pres.Close()
$dt = New-Object System.Data.DataTable
$dt.TableName = $id
[String]$dt.Columns.Add("$id Word")
[Int]$dt.Columns.Add("$id Count")
foreach($r in ($results.GetEnumerator() | sort Value)) {
$dt.Rows.Add($r.Key, $r.Value)
}
$out.Tables.Add($dt)
}
$out | export-csv
There are two main issues:
- The number of unique words is different for each file (hash tables have different length)
- Files are read one-by-one. So the results for each file need to be cached before being exportet.
Somehow I do not get the output that I want, but only meta data. How can I achieve the correct output?