1

I need a quick and easy way to know how many dlls are 32-bit and how many are 64-bit in a given directory. I was about to write a PowerShell script when I thought of a much simpler solution. I've shown below that my idea can work but I need a little regex help to make it work properly.

It has been demonstrated that a dll file can be opened in Notepad to reveal the bitness (32 or 64) simply by checking the character after the first "PE". The letters "L" and "d" imply 32-bit and 64-bit respectively reference. Notepad++ or a hex editor will more accurately show there are actually 2 null characters between the "PE" and the other character as shown in the image below copied from Notepad++.

Screenshots of dlls opened in Notepad++

Unfortunately some of my directories contain hundreds of dlls so it's not practical to open them one at a time with Notepad or any other utility. There are, however powerful "grep-capable" file search utilities that can search a directory for files containing a specified search string. Moreover, some of these can do regular expression (regex) searches. Since I know the unique strings that differentiate 32-bit and 64bit dlls (shown above), such a file search utility should be able to quickly inventory the types of dlls in any directory. The best such file search utility in my opinion is grepWin which can be downloaded and installed for free.

My first attempt was the regex search string ".PE("\x00")*" which can be broken down as follows.

break downs of regex string ".PE("\x00")*"

The image below shows results of a search done using grepWin and the search string ".PE("\x00")*" for a specified directory that had 276 dll files in it. It shows that 276 of the 276 dlls found contained "PE" followed by multiple null characters. It also shows that actually thousands of matches were found. This is because the regex search continued after the first match and found many more matches in larger files that inevitably appear "randomly".

enter image description here

The table below shows search results from regex strings "PE.{2}L" and "PE.{2}d" proposed by O-O-O. These search strings find all the files but unfortunately some of the dll files are being counted twice because the sum of the 32-bit and 64-bit dlls exceeds the total number of dll files in the directory.

files found containing "PE.{2}L" and "PE.{2}d"

The screen shots below of the search results using "PE.{2}L" and "PE.{2}d" show that the matches exceed the number of files found meaning that the regex searches are going beyond the first match.

grepWin details using "PE.{2}L" and "PE.{2}d"

So I only need to know how to modify these regex search strings to stop searching 3 characters after the first "PE" is found. I know this can be done using the ".*?" modifiers but I haven't been able to get it to work. Here is my question.

• How can these search strings be modified to stop reading 3 characters after the first "PE" is found?

Any regex search strings can be verified by searching any directory of dlls with grepWin. To be correct, the search strings must produce an equal number of matches as files unlike the examples shown above. This will verify that the search stopped after the first match was found.

skinnedknuckles
  • 371
  • 3
  • 12
  • Do you have WSL installed? – jhnc Jan 30 '23 at 04:35
  • Please provide at least one example of dll you are checking, and the expected output. – O-O-O Jan 30 '23 at 08:42
  • O-O-O It's counterintuative but generally speaking, the system32 folder holds 64-bit dlls and the SysWOW64 folder holds 32-bit dlls (as you know). The file C:\Windows\system32\quartz.dll has 6 matches using "PE.{2}d" (should be 1 match) and 3 matches using "PE.{2}L" (should be 0 matches). The file C:\Windows\SysWOW64\quartz.dll has 6 matches using "PE.{2}d" (should be 0 matches) and 1 match using "PE.{2}L" (1 match is correct). – skinnedknuckles Jan 30 '23 at 15:20

1 Answers1

2

This can't be true:

  • The regex .PE("\x00")* would search for:
    1. any character (Why at all? To exclude finding it right with the file's start?)

    2. the character P

    3. the character E

    4. the group of:

      1. the character "
      2. the character corresponding to the byte value 00
      3. the character "

      ...as per * with an amount of matchings from never to countless (Why not wanting exactly 2?)

  • Wouldn't it be better to search for PE\x00\x00? Unless grepWin comes with its own flavor of regular expressions where quotation marks in groups have a special meaning. But I highly doubt that.
  • The regexes PE.{2}L and PE.{2}d are like phrases that nobody would use. Why not writing PE..L straight away?

From a technical point of view

We can further restrict a regular expression to not overly match too many false positives and to not ignore things we should also check (it helps knowing how a Portable Executable's layout looks like):

  • Each executable starts with a DOS header, which is always 64 bytes long and almost always starts with MZ (in rare/historical cases also ZM or NE, but not for our case).

  • The NT header always starts with PE\0\0 (or in hexadecimal 50 45 00 00, or in regex PE\x00\x00), which is then followed by either \x64\x86 (for 64 bit) or \x4c\x01 (for 32 bit). This header can start much later, but we can safely assume to find it within the first 2048 bytes of the file (most likely after 240 bytes already).

    Also 18 bytes later we have most likely the bytes \x0b\x01 or \x0b\x02 (or in rare cases \x07\x01).

The better regex

  • For x64 (64 bit) search for ^MZ.{62,2046}PE\x00\x00\x64\x86.{18}\x0b[\x01\x02] and
  • for x86 (32 bit) search for ^MZ.{62,2046}PE\x00\x00\x4c\x01.{18}\x0b[\x01\x02].

If your target software crashes (although it praises its regex support, like grepWin) then

  • either omit matching the DOS header entirely (removing ^MZ.{62,2046}
  • or try reducing the repetition to a smaller one, f.e. {62,280}.

Explanation:

  1. starting at the begin of the file (actually only the start of a "line")

  2. characters M and Z (Mark Zbikowski)

  3. any character for at least 62 times, but at max 2046 times (a text editor like Notepad++ might complain that our regex would be too complex, that's why we also define a maximum)

  4. characters P and E (Portable Executable)

  5. bytes 00 00

  6. the CPU architecture:

    • bytes \x64\x86 for 64 bit (AMD), or
    • bytes \x4c\x01 for 32 bit (Intel 386 or later).

    Don't rely on opticals only (d and L), because then you ignore half of the value and just risk more false positives).

  7. any character for exactly 18 times

  8. byte 0b

  9. either byte 01 or 02

Successfully tested

  • with Notepad++ 8.4.8 x64 (make sure to tick that . matches newline)
  • on C:\Windows\System32\quartz.dll
  • using Windows 7 x64 (so the DLL should be 64 bit):

Notepad++ regex matching DOS and NT headers

The big advantage here is that this regex most likely only matches once instead of multiple times, especially in DLLs. However, since executables have no "end" mark they can carry any format of data afterwards. Unbound to the intention (good = self extracting archives, bad = viruses) there's hardly a way to exclude those - if we're lucky our ^ helps us.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
  • This seems like a more technically sound version of what I was trying to do. Unfortunately I can't get it to work in Notepad++ (with regex enabled and . matches newline checked). I tested ^MZ.{62,2046}PE\x00\x00\x64\x86.{18}\x0b[\x01\x02] using C:\Windows\system32\quartz.dll and tested ^MZ.{62,2046}PE\x00\x00\x4c\x01.{18}\x0b[\x01\x02] using C:\Windows\SysWOW64\quartz.dll. Zero hits on both. – skinnedknuckles Jan 30 '23 at 15:37
  • I wonder what you're doing differently. Also you're able to use [formatting in comments](https://stackoverflow.com/editing-help#comment-formatting), too. – AmigoJack Jan 30 '23 at 16:37
  • I'll learn formatting soon. I don't know what I was doing but your regex search strings work for me now in Notepad++. However, my plan is to use them in grepWin. I tested them in grepWin and of 276 dlls, the 32-bit search string detected 220 dlls but the 64-bit search string found none which is wrong because I know that at least System.Data.dll is a 64-bit library. I'm not saying your search strings are wrong but I'm curious why the second one doesn't seem to work. Screenshots of my tests are here https://drive.google.com/file/d/1HNqsilHeHIVi_TNHVCSW_Q-Z5lMT18Nm/view?usp=sharing – skinnedknuckles Jan 30 '23 at 18:31
  • You should tick "_Treat files as binary_". – AmigoJack Jan 30 '23 at 19:05
  • Yeah, when I do that it crashes so I guess I'll submit that as an issue to the grepWin author. – skinnedknuckles Jan 30 '23 at 19:25
  • I was mistaken this was about regex and counting files properly, not about using individual software. _grepWin_ surely is unstable enough to deal with longer repetitions - try omitting `^MZ.{62,2046}` or reducing it to `{62,280}`. – AmigoJack Jan 30 '23 at 20:13
  • Bingo. The search string PE\x00\x00\x64\x86.{18}\x0b[\x01\x02] finds 56 files (same as 276 - 220)! And the search string PE\x00\x00\x4c\x01.{18}\x0b[\x01\x02] finds 220 files just as before. If you edit your answer I'll mark it as correct and upvote. – skinnedknuckles Jan 30 '23 at 20:32
  • I think it would be helpful to add **PE\x00\x00.{20}\x0b[\x01\x02]** as a regex search string that finds both 32-bit and 64-bit dlls. This would be useful to distinguish them from dlls not containing the NT-header "PE\0\0". – skinnedknuckles Feb 01 '23 at 15:25
  • In that case `PE\x00\x00\(x64\x86|\x4c\x01).{18}\x0b[\x01\x02]` would fit more. [Regex 101: Alternation](https://www.regular-expressions.info/alternation.html). – AmigoJack Feb 01 '23 at 16:33
  • Yes, that would be a more accurate search. (but the third backslash needs to be inside the open-parenthesis that follows it. In other words **PE\x00\x00\\(x64\x86|\x4c\x01).{18}\x0b[\x01\x02]** should be **PE\x00\x00(\x64\x86|\x4c\x01).{18}\x0b[\x01\x02]** ) unless I'm mistaken. – skinnedknuckles Feb 01 '23 at 16:57
  • No, you're not mistaken - that was a typo by me. `\(` would mean to literally search for an opening bracket, and the `)` would then be unpaired syntax wise, making it an invalid regex. – AmigoJack Feb 01 '23 at 17:58