1

Context

When performing filter operations it's generally best to push the filter as far upstream as possible to get good performance; e.g. if I were using PowerShell to get all files from a directory Get-ChildItem 'c:\temp\' -Filter '*.txt' would be preferable to Get-ChildItem 'c:\temp\' | Where-Object {$_.Name -like '*.txt'}.

However, in some situations the upstream component doesn't give us an option to push things upstream; e.g. if we wanted to find any image file we'd either have to make multiple calls to Get-ChildItem passing different values to Filter for each type, but resulting in traversing the directory multiple times and potentially returning the same files (if they match multiple filters); or we have to perform the filtering downstream.

If I were searching for image files (for this specifc example, lets say that's: '*.png', '*.gif', '*.jpg', '*.jpeg') one approach may be to send '*.*g*' as the filter to the provider, so we elimiate a lot of candidates early on, then filter for the specific extensions we're interested in downstream.

Question

Is there a known method for extracting a "like pattern/mask" which represents the partial implementation of a regex?

e.g. so I could implement something like this:

Function Get-ImageFiles {
    Param(
        [Parameter(Mandatory)]
        [string]$LiteralPath
        ,
        [Parameter()]
        [string]$Pattern = '\.(?:png|gif|jpg|jpeg)$'
    )
    $simpleMask = ConvertTo-SimpleMask -RegexPattern $Pattern
    [System.IO.Directory]::EnumerateFiles($LiteralPath, $simpleMask) |
        Select-String -Pattern $Pattern -Raw
}

# for '\.(?:png|gif|jpg|jpeg)$'      simpleMask would be '*.*g*'
# for '\.(?:jpg|jpeg)$'              simpleMask would be '*.jp*g'
# for '\.(?:png|gif|jpg|jpeg|webp)$' simpleMask would be '*.*'

Note: In this question I've used PowerShell for my example code; but I'm interested in any solution to this "regex to simple filter" problem. This is more a question of curiosity than specific to the above example use case.

JohnLBevan
  • 22,735
  • 13
  • 96
  • 178

2 Answers2

2

No, you can't convert an arbitrary regex pattern to a wildcard pattern - as wildcard patterns are essentially a subset of regex.

If performance is your chief concern, enumerate all files and filter based on the Extension attribute - that way you won't need any variable-length matching capabilities at all, and the comparison will be much faster than any regex or wildcard comparison:

Get-ChildItem -File |Where-Object Extension -in '.png','.gif','.jpg','.jpeg'
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
2

To add to Mathias R. Jessen's helpful answer:

if we wanted to find any image file we'd either have to make multiple calls to Get-ChildItem, passing different values to Filter for each type

The -Include parameter allows you to pass multiple patterns and therefore requires only one Get-ChildItem call.

However:

  • Unlike -Filter, -Include does not filter at the source and requires the cmdlet to enumerate all items and itself match them against the specified patterns.

  • Unless you use -Recurse, you'll have to end your input path in * to make -Include work as intended - see this answer for details.

  • Note: that the wildcard "dialects" supported by -Filter vs. -Include (and -Exclude and -Path) differ: -Filter uses the wildcard patterns of the underlying file-system APIs, which are both less powerful than PowerShell's wildcards and have many legacy quirks - see this answer for details.

Therefore, for instance:

# Note the trailing * in the (positionally implied) -Path argument,
# to make -Include work as intended.
# This isn't necessary if -Recurse is also used.
Get-ChildItem c:\temp\* -Include *.png, *.gif, *.jpg, *.jpeg

To improve performance you can actually combine a -Filter argument with -Include, which means that you'll get fast preliminary filtering with -Filter, but with false positives, which the -Include patterns then eliminate:

# Note the use of *both* -Filter and -Include
Get-ChildItem c:\temp\* -Filter *.*g* -Include *.png, *.gif, *.jpg, *.jpeg

Note that a pattern such as *.*g* doesn't limit matching of the *g* part to the extension, and also matches file names such as f.g.txt or f._g.txt

mklement0
  • 382,024
  • 64
  • 607
  • 775