5

Is there a way to determine whether a specified file contains a specified byte array (at any position) in powershell?

Something like:

fgrep --binary-files=binary "$data" "$filepath"

Of course, I can write a naive implementation:

function posOfArrayWithinArray {
    param ([byte[]] $arrayA, [byte[]]$arrayB)
    if ($arrayB.Length -ge $arrayA.Length) {
        foreach ($pos in 0..($arrayB.Length - $arrayA.Length)) {
            if ([System.Linq.Enumerable]::SequenceEqual(
                $arrayA,
                [System.Linq.Enumerable]::Skip($arrayB, $pos).Take($arrayA.Length)
            )) {return $pos}
        }
    }
    -1
}

function posOfArrayWithinFile {
    param ([byte[]] $array, [string]$filepath)
    posOfArrayWithinArray $array (Get-Content $filepath -Raw -AsByteStream)
}

// They return position or -1, but simple $false/$true are also enough for me.

— but it's extremely slow.

Sasha
  • 3,599
  • 1
  • 31
  • 52
  • `[byte]$(get-content -path "C:\Thing") | findstr "word"`? – Nico Nekoru Jun 16 '20 at 03:26
  • @NekoMusume, do you mean `[byte[]]$(get-content -path "C:\Thing" -AsByteStream) | findstr "word"`? AFAIK, it doesn't guarantee to work with non-text data. And besides that it's even slower. – Sasha Jun 16 '20 at 03:46
  • 2
    What kind of performance you are looking for? How long byte patterns are you looking for? How big files are you processing? – vonPryz Jun 16 '20 at 08:40
  • @vonPryz, searching 0.5 MiB fragment within 100 MiB sequence takes more than 5 minutes. It's extremely long for modern PCs (I agree to wait seconds or even tens of seconds but not minutes). – Sasha Jun 16 '20 at 20:50

4 Answers4

3

Sorry, for the additional answer. It is not usual to do so, but the universal question intrigues me and the approach and information of my initial "using -Like" answer is completely different. Btw, if you looking for a positive response to the question "I believe that it must exist in .NET" to accept an answer, it probably not going to happen, the same quest exists for StackOverflow searches in combination with C#, .Net or Linq.
Anyways, the fact that nobody is able to find the single assumed .Net command for this so far, it is quiet understandable that several semi-.Net solutions are being purposed instead but I believe that this will cause some undesired overhead for a universal function.
Assuming that you ByteArray (the byte array being searched) and SearchArray (the byte array to be searched) are completely random. There is only a 1/256 chance that each byte in the ByteArray will match the first byte of the SearchArray. In that case you don't have to look further, and if it does match, the chance that the second byte also matches is 1/2562, etc. Meaning that the inner loop will only run about 1.004 times as much as the outer loop. In other words, the performance of everything outside the inner loop (but in the outer loop) is almost as important as what is in the inner loop!
Note that this also implies that the chance a 500Kb random sequence exists in a 100Mb random sequence is virtually zero. (So, how random are your given binary sequences actually?, If they are far from random, I think you need to add some more details to your question). A worse case scenario for my assumption will be a ByteArray existing of the same bytes (e.g. 0, 0, 0, ..., 0, 0, 0) and a SearchArray of the same bytes ending with a different byte (e.g. 0, 0, 0, ..., 0, 0, 1).

Based on this, it shows again (I have also proven this in some other answers) that native PowerShell commands aren't that bad and possibly could even outperform .Net/Linq commands in some cases. In my testing, the below Find-Bytes function is about 20% till twice as fast as the function in your question:

Find-Bytes

Returns the index of where the -Search byte sequence is found in the -Bytes byte sequence. If the search sequence is not found a $Null ([System.Management.Automation.Internal.AutomationNull]::Value) is returned.

Parameters

-Bytes
The byte array to be searched

-Search
The byte array to search for

-Start
Defines where to start searching in the Bytes sequence (default: 0)

-All
By default, only the first index found will be returned. Use the -All switch to return the remaining indexes of any other search sequences found.

Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
    For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
        For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
        If ($i -ge $Search.Length) { 
            $Index
            If (!$All) { Return }
        } 
    }
}

Usage example:

$a = [byte[]]("the quick brown fox jumps over the lazy dog".ToCharArray())
$b = [byte[]]("the".ToCharArray())

Find-Bytes -all $a $b
0
31

Benchmark
Note that you should open a new PowerShell session to properly benchmark this as Linq uses a large cache that properly doesn't apply to your use case.

$a = [byte[]](&{ foreach ($i in (0..500Kb)) { Get-Random -Maximum 256 } })
$b = [byte[]](&{ foreach ($i in (0..500))   { Get-Random -Maximum 256 } })

Measure-Command {
    $y = Find-Bytes $a $b
}

Measure-Command {
    $x = posOfArrayWithinArray $b $a
}
iRon
  • 20,463
  • 10
  • 53
  • 79
  • Nobody said that the data is uniformly distributed. – Sasha Jun 22 '20 at 11:54
  • Nobody said that the data is *not* uniformly distributed (or that there are patterns in the given data). No offense, but in your own answer you state "*... to match sequences **universally***", with also suggests you looking for matching unknown/random data, otherwise you will need to be more specific with what type of sequence you actually want to match. If the data is specific, and contains certain patterns, you might be able to anticipated on that. Anyways, I still like to know if it makes any difference for you. – iRon Jun 22 '20 at 12:17
  • 1
    No offense from my side either, but _unknown data_ doesn't mean _uniformly distributed_ data (I said "universally", and uniform distribution is just a specific particular case). – Sasha Jun 22 '20 at 12:44
  • Your solution is really **much faster** :) 14 seconds (vs. 540 seconds for the solution the question). I'm currently quite puzzled why. – Sasha Jun 22 '20 at 12:46
  • I am happy to see that, although I do not have an explanation either. What I see is that the `Find-Bytes` function takes about 50% to 80% of the time of the `posOfArrayWithinArray` function (where there is **no match found**). Be aware that I swapped the **ByteArray** - with the **SearchArray** argument in my function. Btw, I had an other thought: is not that important that the **ByteArray** is uniformly distributed, it is more important how much the **SearchArray** is uniformly distributed. – iRon Jun 22 '20 at 14:03
  • 1
    "50% to 80%" — probably, because you use much shorter arrays (I search for 500 KiB within 100 MiB, your benchmark searches for 500 B within 500 KiB). – Sasha Jun 22 '20 at 14:21
1

The below code may prove to be faster, but you will have to test that out on your binary files:

function Get-BinaryText {
    # converts the bytes of a file to a string that has a
    # 1-to-1 mapping back to the file's original bytes. 
    # Useful for performing binary regular expressions.
    Param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
        [ValidateScript( { Test-Path $_ -PathType Leaf } )]
        [Alias('FullName','FilePath')]
        [string]$Path
    )

    $Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding     = [Text.Encoding]::GetEncoding(28591)
    $StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
    $BinaryText   = $StreamReader.ReadToEnd()

    $Stream.Dispose()
    $StreamReader.Dispose()

    return $BinaryText
}

# enter the byte array to search for here
# for demo, I'll use 'SearchMe' in bytes
[byte[]]$searchArray = 83,101,97,114,99,104,77,101

# create a regex from the $searchArray bytes
# 'SearchMe' --> '\x53\x65\x61\x72\x63\x68\x4D\x65'
$searchString = ($searchArray | ForEach-Object { '\x{0:X2}' -f $_ }) -join ''
$regex = [regex]$searchString

# read the file as binary string
$binString = Get-BinaryText -Path 'D:\test.bin'

# use regex to return the 0-based starting position of the search string
# return -1 if not found
$found = $regex.Match($binString)
if ($found.Success) { $found.Index } else { -1}
Theo
  • 57,719
  • 8
  • 24
  • 41
  • `Get-BinaryText -Path $path` can be replaced with `Get-Content -Path $path -Raw -Encoding 28591` (or simply `Get-Content $path -Raw`, we don't even need to specify encoding). – Sasha Jun 16 '20 at 19:02
  • Thanks for your answer. To say truth, I didn't tried it: I emotionally dislike the idea of recoding the input data to a longer form (it looks like as redundant step for me) and I've found other workaround (see below) — but still it should be faster than the one written by me in the question (so your idea is in fact working). But it's a pity than we can't easily find sequence-in-seqence-search method in .NET (it must exist, especially that it exists for character strings, i.e. the algorithm is already implemented). – Sasha Jun 16 '20 at 19:07
  • 1
    @Sacha Pitty you are not willing to try, because it will be a lot faster than your sequential search. By the way you **REALLY DO** need encoding with Codepage 28591 to get a 1-to-1 char to byte mapping, so no byte is altered during the conversion to string. If you care to look up [Get-Content](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content?view=powershell-7#parameters) you will find that the `-Encoding` parameter does not accept value `28591`. That is why you need my function and can never get it with Get-Content. – Theo Jun 16 '20 at 19:16
  • @Sasha `Get-Content -Raw` reads the content of a file as UTF-16 encoded [string]. The [IndexOf()](https://learn.microsoft.com/en-us/dotnet/api/system.string.indexof) can look for a `[char]` or `[string]` inside a string. a [char](https://learn.microsoft.com/en-us/dotnet/api/system.char) represents a character as a UTF-16 code unit (--> 2 bytes). Your question is about finding the index of a `[byte[]]` array in a binary file. This is why you need to have a function like yours, or get the binary content in the form of a string with **unaltered** bytes so you can do a regex `Match` on it. – Theo Jun 16 '20 at 19:45
  • Yep, **you're right**! We do need to specify encoding, because the default UTF-8 converts non-well-formed code-units to U+FFFD characters (I didn't noticed it immediately, because such substitution was done on _both_ sides, so I still observed "correct" results in my cases, but it may produce many false positive results in general case). Still, [`Get-Content`](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content) does accept `-Encoding 28591`, see: "Beginning with PowerShell 6.2, the Encoding parameter also allows numeric IDs…". – Sasha Jun 16 '20 at 19:47
  • ("…As UTF-16 encoded" — actually as BOM-lessly UTF-8 encoded, but that doesn't really matter, as both are bad in this case.) – Sasha Jun 16 '20 at 19:50
  • 1
    @Sasha Aha, that is as-of PowerShell 6.2. I'm using 5.1 – Theo Jun 16 '20 at 19:50
  • Before 6.2, I'd use `[Text.Encoding]::GetEncoding(28591).GetString([IO.File]::ReadAllBytes($filepath))`. – Sasha Jun 16 '20 at 20:09
1

Just formalizing my comments and agreeing with your comment:

I dislike the idea of converting byte sequences to character sequences at all (I'd better have functionality to match byte (or other) sequences as they are), among the conversion-to-character-strings-implying solutions this seems to be one of the quickest

Performance

String manipulations are usually expensive but re-initializing a LINQ call is apparently pretty expensive as well. I guess, that you might presume that the native algorithms for the PowerShell string representation and methods (operators) like -Like are meanwhile completely squeezed.

Memory

Aside from some founded performance disadvantages, there is a memory disadvantage as well by converting each byte to a decimal string representation. In the purposed solution, each byte will take an average of 2.57 bytes (depending on the number of decimal digits of each byte: (1 * 10 / 256) + (2 * 90 /256) + (3 * 156 / 256)). Besides you will use/need an extra byte for separating the numeric representations. In total, this will increase the sequence about 3.57 times!.
You might consider saving bytes by e.g. converting it to hexadecimal and/or combine the separator, but that will likely result in an expensive conversion again.

Easy

Anyways, the easy way is probably still the most effective.
This comes down to the following simplified syntax:

" $Sequence " -Like "* $SubSequence *" # $True if $Sequence contains $SubSequence

(Where $Sequence and $SubSequence are binary arrays of type: [Byte[]])

Note 1: the spaces around the variables are important. This will prevent a false positive in case a 1 (or 2) digit byte representation overlaps with a 2 (or 3) digit byte representation. E.g.: 123 59 74 contains 23 59 7 in the string representation but not in the actual bytes.

Note 2: This syntax will tell you only whether $arrayA contains $arrayB ($True or $False). There is no clue where $arrayB actually resides in $arrayA. If you need to know this, or e.g. want to replace $arrayB with something else, refer to this answer: Methods to hex edit binary files via PowerShell .

iRon
  • 20,463
  • 10
  • 53
  • 79
  • What do you mean by "re-initializing a LINQ call"? – Sasha Jun 18 '20 at 13:04
  • I like the idea of `" $arrayA " -Like "* $arrayB *"` and your analysis of it (despite the fact I don't understand it in full, see my previous comment), so I've upvoted it. **But!** The answers on Stack Overflow should be written in a way that they are: (1) useful for wide audience (say, for a reader who comes from google); (2) don't contextually depend on comments (which can disappear now or then). So, I recommend you to put the main idea (`" $arrayA " -Like "* $arrayB *"`) and its usage notes (Note 1, Note 2) in the beginning and its analysis — only afterwards as a subchapter. – Sasha Jun 18 '20 at 13:12
0

I've determined that the following can work as a workaround:

(Get-Content $filepath -Raw -Encoding 28591).IndexOf($fragment)

— i.e. any bytes can be successfully matched by PowerShell strings (in fact, .NET System.Strings) when we specify binary-safe encoding. Of course, we need to use the same encoding for both the file and fragment, and the encoding must be really binary-safe (e.g. 1250, 1000 and 28591 fit, but various species of Unicode (including the default BOM-less UTF-8) don't, because they convert any non-well-formed code-unit to the same replacement character (U+FFFD)). Thanks to Theo for clarification.

On older PowerShell, you can use:

[System.Text.Encoding]::GetEncoding(28591).
    GetString([System.IO.File]::ReadAllBytes($filepath)).
    IndexOf($fragment)

Sadly, I haven't found a way to match sequences universally (i.e. a common method to match sequences with any item type: integer, object, etc). I believe that it must exist in .NET (especially that particual implementation for sequences of characters exists). Hopefully, someone will suggest it.

Sasha
  • 3,599
  • 1
  • 31
  • 52
  • 1
    Could you explain with more details what you are after? Looking for 0.5 MB substring from 100 MB of data doesn't seem like a common problem. Maybe there would be better ways to solve the actual problem, if you elaborate. Anyway, see question about [substring search algorithms](https://stackoverflow.com/q/3183582/503046). – vonPryz Jun 16 '20 at 20:56
  • @vonPryz, yes, a substring search algorithm is of course exactly the thing I need. The only problem is I don't want to implement it manually, I believe that such functionality **must** be a part of shell (and especially of .NET; especially that it's already implemented for character types). What I need is simple — to test whether a file is a (consecutive) part of another file (e.g. frame in video). – Sasha Jun 16 '20 at 21:04