2

Okay so this is actually a problem I have been able to fix, but I still do not understand why the problem existed in the first place.

I have been using tshark on network traffic with the intention of creating a txt or csv file containing key information I can use for machine learning. At first glance the file looked perfectly fine and exactly how I imagined. However, in python I notice some strange inital characters and when applying the split operator, suddenly I am working on bytecode.

My powershell script initially looked like this:

$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst

Get-ChildItem $src -Filter *.pcap | 
Foreach-Object {
    $content = Get-Content $_.FullName
    $filename=$_.BaseName
    tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta > $dst$filename.txt
}

Now I try to read this file in my jupyter notebook

directory = "G://.../train_data/tsharked/"
file = open(directory+"example.txt", "r")
for line in file.readlines():
    print(line)
    words = line.split(",")
    print(words)
    break

The result looks like this

ÿþ134.169.109.51,134.169.109.25,543,0.000000000,0.000000000

['ÿþ1\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x005\x001\x00', '\x001\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x002\x005\x00', '\x005\x004\x003\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00\n']

When I opened the textfile in Editor, the special characters ÿþ did not appear. This is the first time I see them. What do they even mean here? Anyhow I managed to fix this only by removing the output redirection in my powershell script.

$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst

Get-ChildItem $src -Filter *.pcap | 
Foreach-Object {
    $content = Get-Content $_.FullName
    $filename=$_.BaseName
    $out = tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta
    Set-Content -Path $dst$filename.txt -Value $out
}

And this is where I am asking myself the question of how it is possible that the output redirection in powershell has managed to write some kind of byte output? In my understanding this is simply a redirection of the console output, hence the name. How can this be anything but a String?

ThomasFG
  • 23
  • 4

1 Answers1

3
  • As of PowerShell 7.2, output from external programs is invariably decoded as text before further processing, which means that raw (byte) output can neither be passed on via | nor captured with >. See this answer for details.

  • PowerShell's > redirection operator is effectively an alias of Out-File, and its default character encoding therefore applies.

In Windows PowerShell, Out-File defaults to "Unicode" encoding, i.e. UTF-16LE:

  • This encoding uses a BOM (byte-order mark), whose bytes, if interpreted individually as ANSI (Windows-1252) bytes, render as ÿþ), and it represents most characters as two-byte sequences,[1] which in the case of most characters in the Windows-1252 character set (which itself is a superset of ASCII) means that the second byte in each sequence is a NUL (0x0 byte) - this is what you're seeing.

Fortunately, in PowerShell (Core) 7+, all file-processing cmdlets now consistently default to (BOM-less) UTF-8.

To use a different encoding, either call Out-File explicitly and use its -Encoding parameter, or - as you have done, and as is generally preferable for the sake of performance when dealing with data that already is text - use Set-Content.


[1] At least two bytes are needed per character; for characters outside the so-called BMP (Basic Multilingual Plane), a pair of two-byte sequences is needed.

mklement0
  • 382,024
  • 64
  • 607
  • 775