0

I am trying to return the highest 4 digit number found in string pattern, in a set of documents.

String Pattern: 3 Letters dash 4 Digits

The word documents contain within them a document identifier code such as below.

Sample Files:

Car Parts.docx > CPW - 2345

CarHandles.docx > CPW - 8723

CarList.docx > CPA - 9083

I have referenced sample code that I am trying to adapt. I am not a VBA or powershell programmer - so I may be wrong in what I am trying to do?

I am happy to look at alternatives - on a Windows platform.

I have referenced this to get me started

http://chris-nullpayload.rhcloud.com/2012/07/find-and-replace-string-in-all-docx-files-recursively/

PowerShell: return the number of instances find in a file for a search pattern

Powershell: return filename with highest number

$list = gci "C:\Users\WP\Desktop\SearchFiles" -Include *.docx -Force -recurse
foreach ($foo in $list) {

$objWord = New-Object -ComObject word.application
$objWord.Visible = $False

$objDoc = $objWord.Documents.Open("$foo")
$objSelection = $objWord.Selection 

$Pat1 = [regex]'[A-Z]{3}-[0-9]{4}'   # Find the regex match 3 letters  followed by 4 numbers eg     HGW - 1024

$findtext= "$Pat1"

 $highestNumber = 

 # Find the highest occurrence of this pattern found in the documents searched - output to text file or on screen

Sort-Object |                   # This may also be wrong -I added it for when I find the pattern
Select-Object -Last 1 -ExpandProperty Name


<#   The below may not be needed  - ?

$ReplaceText = ""

$ReplaceAll = 2
$FindContinue = 1
$MatchFuzzy = $False
$MatchCase = $False
$MatchPhrase = $false
$MatchWholeWord = $True
$MatchWildcards = $True
$MatchSoundsLike = $False
$MatchAllWordForms = $False
$Forward = $True
$Wrap = $FindContinue
$Format = $False

$objSelection.Find.execute(
    $FindText,
    $MatchCase,
    $MatchWholeWord,
    $MatchWildcards,
    $MatchSoundsLike,
    $MatchAllWordForms,
    $Forward,
    $Wrap,
    $Format,
    $ReplaceText,
    $ReplaceAll
  }

}
#>

I appreciate any advice on how to proceed -

Community
  • 1
  • 1
wp44
  • 265
  • 1
  • 4
  • 13
  • The best way to proceed, is to determine what exactly is not working, find out why and then fix that. If you have any *specific* problems getting this up and running, do not hesitate to ask. – Andrew Savinykh Mar 15 '16 at 22:10
  • Hi Andrew, I could not find the pattern $Pat1 = [regex]'[A-Z]{3}-[0-9]{4}' to begin with - I have added notes to where i am stuck on the code. – wp44 Mar 15 '16 at 22:14
  • btw, what Powershell version are you on? – Andrew Savinykh Mar 15 '16 at 22:35
  • Hi Andrew, I believe 5, I am on windows 10 - I have power shell ISE as Well. thank you – wp44 Mar 15 '16 at 22:53

2 Answers2

2

Try this:

# This library is needed to extact zip archives. A .docx is a zip archive
# .NET 4.5 or later is requried
Add-Type -AssemblyName System.IO.Compression.FileSystem

# This function gets plain text from a word document
# adapted from http://stackoverflow.com/a/19503654/284111
# It is not ideal, but good enough
function Extract-Text([string]$fileName) {

  #Generate random temporary file name for text extaction from .docx
  $tempFileName = [Guid]::NewGuid().Guid

  #Extract document xml into a variable ($text)
  $entry = [System.IO.Compression.ZipFile]::OpenRead($fileName).GetEntry("word/document.xml")
  [System.IO.Compression.ZipFileExtensions]::ExtractToFile($entry,$tempFileName)
  $text = [System.IO.File]::ReadAllText($tempFileName)
  Remove-Item $tempFileName

  #Remove actual xml tags and leave the text behind
  $text = $text -replace '</w:r></w:p></w:tc><w:tc>', " "
  $text = $text -replace '</w:r></w:p>', "`r`n"
  $text = $text -replace "<[^>]*>",""

  return $text
}

$fileList = Get-ChildItem "C:\Users\WP\Desktop\SearchFiles" -Include *.docx -Force -recurse
# Adapted from http://stackoverflow.com/a/36023783/284111
$fileList | 
  Foreach-Object {[regex]::matches((Extract-Text $_), '(?<=[A-Za-z]{3}\s*(?:-|–)\s*)\d{4}')} | 
  Select-Object -ExpandProperty captures | 
  Sort-Object value -Descending | 
  Select-Object -First 1 -ExpandProperty value 

The main idea behind this is not to monkey around the COM api for Word, but instead just try and extract the text information from the document manually.

Andrew Savinykh
  • 25,351
  • 17
  • 103
  • 158
  • Andrew, thank you for your generous help. It did find the highest number and out put it to screen. I am happy - it does what I need it to - which is find the highest number from the documents - and that's good enough for me now. It did give some errors, but considering I have no experience - I will learn more about this and slowly build up my skills - thanks again for your help. – wp44 Mar 16 '16 at 04:52
0

The way to get the highest number is first isolate it using a regex and then sort and select the first item. Something like this:

[regex]::matches($objSelection, '(?<=[A-Z]{3}\s*-\s*)\d{4}')  `
  | Select -ExpandProperty captures `
  | sort value -Descending `
  | Select -First 1 -ExpandProperty value `
  | Add-Content outfile.txt

I think the problem you are having with your regex is that your example data contains spaces around the dash in the code which haven't allowed for in your pattern.

Dave Sexton
  • 10,768
  • 3
  • 42
  • 56
  • Hi Dave -Thank you for your help - I have done this $list = gci "C:\Users\WP\Desktop\Search\" -Include *.docx -Force -recurse foreach ($foo in $list) {$objWord = New-Object -ComObjectword.application$objWord.Visible = $False $objDoc = $objWord.Documents.Open("$foo")$objSelection = $objWord.Selection [regex]::matches($objSelection, '(?<=[A-Z]{3}\s*\s*)\d{4}') ` | Select -ExpandProperty captures ` | sort value -Descending ` | Select -First 1 -ExpandProperty value ` | Add-Content outfile.txt } - I must have done it wrong -+ FullyQualifiedErrorId : System.Runtime.InteropServices.COMException – wp44 Mar 15 '16 at 23:32
  • Thank you Dave - I will use your code - to add to my powershell tools :) – wp44 Mar 16 '16 at 04:52