1

I have a file with a lot of text in it. This is how I read it:

file = open('labb9Text.txt', "r")

for lines in file:
    txt = str(lines)
    byteArr = bytearray(txt, "utf-8")

Now I want to write a function makeHisto(byteArr) that returns a histogram (a list of length 256) which indicates how many times each number/bit-pattern (0-255) occurs in byteArr. Since I am pretty new to python I do not now where to start, any suggestions how to do this? Thanks

martineau
  • 119,623
  • 25
  • 170
  • 301
Dovahkiin
  • 55
  • 7
  • This: https://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python ? – alvas Feb 10 '20 at 17:50
  • Consider having 256 buckets as a dict, and updating the value in the bucket as you see the next byte. `defaultdict` may be your friend. – 9000 Feb 10 '20 at 17:51
  • Also, the way you – 9000 Feb 10 '20 at 17:53
  • 1
    You could directly read the binary data since in your case, I think the encoding is irrelevant. Check the `'rb'` flag of Python's `open()` for that. Then, you could e.g. use numpy`s histogram function to prep the visualization, see [here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html) – FObersteiner Feb 10 '20 at 18:01

2 Answers2

2

Try this:

import sys

import requests 
from io import StringIO

import seaborn as sns # for data visualization
sns.set()

# To just take a file from https://norvig.com/big.txt
fin = StringIO(requests.get('https://norvig.com/big.txt').content.decode('utf8'))

num_symbols, num_bytes = [], []

for line in fin:
    # Get size of string in bytes.
    num_bytes.append(sys.getsizeof(line))
    # Get no. of chars in string
    num_symbols.append(len(line))

# Plot the graph.
sns.distplot(num_symbols)

# Plot the other graph.
sns.set()
sns.distplot(num_bytes)

Most probably, plotting them together would be more informative, try:

sns.distplot(num_symbols, label="chars")
sns.distplot(num_bytes, label="bytes")
alvas
  • 115,346
  • 109
  • 446
  • 738
1

You could use [Python 3.Docs]: class collections.Counter([iterable-or-mapping]) on the file contents:

>>> import collections
>>>
>>> file_name = r"C:\Windows\comsetup.log"
>>>
>>> with open(file_name, "rb") as fin:
...     text = fin.read()
...
>>> len(text)
771
>>>
>>> text
b'COM+[12:31:53]: ********************************************************************************\r\nCOM+[12:31:53]: Setup started - [DATE:12,24,2019 TIME: 12:31 pm]\r\nCOM+[12:31:53]: ********************************************************************************\r\nCOM+[12:31:53]: Start CComMig::Discover\r\nCOM+[12:31:53]: Return XML stream: <migXml xmlns=""><rules context="system"><include><objectSet></objectSet></include></rules></migXml>\r\nCOM+[12:31:53]: End CComMig::Discover - Return 0x00000000\r\nCOM+[12:31:56]: ********************************************************************************\r\nCOM+[12:31:56]: Setup (COMMIG) finished - [DATE:12,24,2019 TIME: 12:31 pm]\r\nCOM+[12:31:56]: ********************************************************************************\r\n'
>>>
>>> hist = collections.Counter(text)
>>>
>>> hist
Counter({42: 320, 58: 38, 32: 32, 49: 26, 101: 19, 50: 17, 51: 17, 77: 16, 116: 16, 67: 14, 91: 11, 93: 11, 48: 11, 109: 11, 79: 10, 115: 10, 105: 10, 43: 9, 53: 9, 13: 9, 10: 9, 114: 9, 117: 8, 110: 8, 60: 8, 62: 8, 111: 7, 99: 7, 108: 7, 83: 5, 100: 5, 69: 5, 112: 4, 68: 4, 84: 4, 44: 4, 103: 4, 34: 4, 47: 4, 97: 3, 45: 3, 73: 3, 88: 3, 120: 3, 54: 3, 65: 2, 52: 2, 57: 2, 118: 2, 82: 2, 61: 2, 98: 2, 106: 2, 76: 1, 121: 1, 40: 1, 71: 1, 41: 1, 102: 1, 104: 1})
>>>
>>> chr(42).encode()  # For testing purposes only
b'*'
>>>
>>> text.count(b"*")
320

hist is a mapping where each key is a byte ([0..255]) that was encountered in the text, and the corresponding value is its occurrences count.

CristiFati
  • 38,250
  • 9
  • 50
  • 87