Okay so this is actually a problem I have been able to fix, but I still do not understand why the problem existed in the first place.
I have been using tshark on network traffic with the intention of creating a txt or csv file containing key information I can use for machine learning. At first glance the file looked perfectly fine and exactly how I imagined. However, in python I notice some strange inital characters and when applying the split operator, suddenly I am working on bytecode.
My powershell script initially looked like this:
$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst
Get-ChildItem $src -Filter *.pcap |
Foreach-Object {
$content = Get-Content $_.FullName
$filename=$_.BaseName
tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta > $dst$filename.txt
}
Now I try to read this file in my jupyter notebook
directory = "G://.../train_data/tsharked/"
file = open(directory+"example.txt", "r")
for line in file.readlines():
print(line)
words = line.split(",")
print(words)
break
The result looks like this
ÿþ134.169.109.51,134.169.109.25,543,0.000000000,0.000000000
['ÿþ1\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x005\x001\x00', '\x001\x003\x004\x00.\x001\x006\x009\x00.\x001\x000\x009\x00.\x002\x005\x00', '\x005\x004\x003\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00', '\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x000\x00\n']
When I opened the textfile in Editor, the special characters ÿþ did not appear. This is the first time I see them. What do they even mean here? Anyhow I managed to fix this only by removing the output redirection in my powershell script.
$src = "G:\...\train_data\"
$dst = $src+"tsharked\"
Write-Output $dst
Get-ChildItem $src -Filter *.pcap |
Foreach-Object {
$content = Get-Content $_.FullName
$filename=$_.BaseName
$out = tshark -r $_.FullName -T fields -E separator="," -E quote=n -e ip.src -e ip.dst -e tcp.len -e frame.time_relative -e frame.time_delta
Set-Content -Path $dst$filename.txt -Value $out
}
And this is where I am asking myself the question of how it is possible that the output redirection in powershell has managed to write some kind of byte output? In my understanding this is simply a redirection of the console output, hence the name. How can this be anything but a String?