1

I am dealing with a large text file containing the decimal places of pi that has this format. Note that the header is all numbers and does not have a string.

Header format: Number_of_sequences Total_Pi_Digits File_Version_Number

550 10000 5

*Pi Sequence Part 1
1415926535897932384
*Pi Sequence Part 2
6264338327950288419
*Pi Sequence Part 3
1693993751058209749

I need to make a sliding window that crops the file using three arguments (window_size, step_size, and last_windowstart). last_windowstart is where the last window starts.

The number of files is determined by dividing the Total_Pi_Digits by the window.

If the file had 99 Total_Pi_Digits, window_size of 10, and a step_size of zero, there would be a total of 11 windows since 99//10=10 and 99%10 leaves 9 in window 11.

lastwindow_start should be 90 I guess for this example. I am not sure that I need last_window start.

For each a window, a file will be created with the name PiSubsection# where # is the window number.

For each file, every window should have the same new header where Number_of_sequences Total_Pi_Digits File_Version_Number is the header format.

Number_of_sequences Total_Pi_Digits will change based upon window_size and step_size but File_Version_Number must not change.

My problem is that my sliding window algorithm does not account for a step_size of 0 and it does not produce the right amount of files. It produces twice as many files so far and I am not sure why.

Additionally, I am not sure that even I understand the math for the amount of windows in a sliding window algorithm.

How do I fix my sliding window algorithm to accept a step_size of 0 and produce the right amount of output files?

    inputFileName = example.txt

    import shlex

    def sliding_window(windows_size, step_size, lastwindow_start):
        for i in xrange(0, lastwindow_start, step_size):
            yield (i, i + windows_size)

    def PiCrop(windows_size, step_size):

    with open(inputFileName, 'r') as input:
        first_line = shlex.split(input.readline().strip())
        PiNumber = int(first_line[1])

        lastwindow_start = PiNumber-(PiNumber%windows_size)
        flags = [False for i in range(lastwindow_start)]

        first_line[1] = str(windows_size * int(first_line[0]))

        first_line = " ".join(first_line)

        for line in input:
            if line.startswith(first_line[0]):
                pass
            elif line.startswith('*'):
                Indiv = line
            else:
                for counter, window in        enumerate(sliding_window(windows_size,step_size,lastwindow_start)):
                    newline = line[window[0]:window[1]]

                    with open('PiSection{}.txt'.format(counter), 'a') as output:
                        if (flags[counter] == False):
                            flags[counter] = True
                            output.write(first_line + '\n')
                        output.write(Indiv)
                        output.write(newline + '\n')
  • 1
    Is your concept of a sliding window the same as is described here? http://stackoverflow.com/questions/8269916/what-is-sliding-window-algorithm-examples Also, seems to me that the step size must be at least 1 or the algorithm would stay on the same index? – David Dec 13 '15 at 01:47
  • Yes. That is what I was confused about. The problem is that it does create the right number of files. –  Dec 13 '15 at 02:03

2 Answers2

0

The sample code below offers an alternative way of doing it that avoids needing to do the calculations. I've taken the view that you have no issue either loading the digits file or actually writing the 'window' files afterwards, so my code assumes they are loaded and produces an array of windows of digits ready to write.

From that result you can simply iterate over the now-derived windows and output the files as before, or you can dip into the nested data and get individual windows for processing as you need.

Example output is below. Let me know if anything needs more detail ...

import pprint

# Separated just for easy comparison with the output.
pi_digits = '1415926535' + '8979323846' + '2643383279' + '5028841916' + '9399375105' + '8209749'
total_digits = len(pi_digits)

def splitIntoWindows(digits, window_size):
    result = []
    count = 0
    window = -1
    for digit in digits:
        index = count % window_size
        if index == 0:
            window += 1
            result.append([])
        result[window] += digit
        count += 1
    return result

windows = splitIntoWindows(pi_digits, 10)

print("Split into {} window(s):".format(len(windows)))
pprint.pprint(windows)

Sample output:

Split into 6 window(s):
[['1', '4', '1', '5', '9', '2', '6', '5', '3', '5'],
 ['8', '9', '7', '9', '3', '2', '3', '8', '4', '6'],
 ['2', '6', '4', '3', '3', '8', '3', '2', '7', '9'],
 ['5', '0', '2', '8', '8', '4', '1', '9', '1', '6'],
 ['9', '3', '9', '9', '3', '7', '5', '1', '0', '5'],
 ['8', '2', '0', '9', '7', '4', '9']]

EDIT

To avoid too much assumption on my part, here's a snippet to parse the loaded digits file:

# Assumed these are the contents loaded in:
file_contents = '''
550 10000 5

*Pi Sequence Part 1
1415926535897932384
*Pi Sequence Part 2
6264338327950288419
*Pi Sequence Part 3
1693993751058209749
'''

pi_digits = ''
line_num = 0
for line in file_contents.split('\n'):
    line = line.strip()
    if (len(line) > 0) & (line[0:1] != "*"):
        line_num += 1
        if (line_num > 1):
            pi_digits += line

This should leave pi_digits ready to use, so you can just replace the declaration of pi_digits in my code above with this instead.

K Cartlidge
  • 171
  • 1
  • 6
0

The solution is to store the file in a list and then use chunks of that list with a sliding window generator to create all of the mini files.

inputFileName = "sample.txt"

import itertools
import linecache

def sliding_window(window_size, step_size, lastwindow_start):
    for i in xrange(0, lastwindow_start, step_size):
        yield (i, i + window_size)
    yield (lastwindow_start, total_pi_digits)

def PiCrop(window_size, step_size):

    f = open(inputFileName, 'r')

    first_line = f.readline().split()

    total_pi_digits = int(first_line[0])

    lastwindow_start = total_pi_digits-(total_pi_digits%window_size)

    lastcounter = (total_pi_digits//window_size)*(window_size/step_size)

    flags = [False for i in range(lastcounter)]

    first_line[0] = str(window_size)
    second_line = f.readline().split()
    offset = int(round(float(second_line[0].strip('\n'))))
    first_line = " ".join(first_line)

    f. close()

    with open(inputFileName, 'r') as f:
        header = f.readline()
        data = [line.strip().split(',') for line in f.readlines()]

        for counter, window in enumerate(sliding_window(window_size,step_size,lastwindow_start)):
            chunk = data[window[0]:window[1]]

            print window

            with open('PiCrop_{}.txt'.format(counter), 'w') as output:

                if (flags[counter] == False):
                    flags[counter] = True

                    headerline = float(linecache.getline(inputFileName, window[1]+1)) - offset
                    output.write(str(window_size) + " " + str("{0:.4f}".format(headerline)) + " " + 'L' + '\n')

                for item in chunk:
                    newline = str("{0:.4f}".format(float(str(item).translate(None, "[]'"))-offset))
                    output.write(str(newline) + '\n')