How to split a single file into multiple file of different size using Python

Question

I wrote one python script which reads a file offset and file name from a list and divide a one large file into multiple files. For splitting i am using shell script which takes these names and offset as input and create multiple output files using head command. I am using python to send the input to the shell script. This is working fine in my Windows 7, and other Linux systems. But when i am trying to use the same on ESX 6.5 hypervisor, i realize i cannot use the same shell script in ESX 6.5 as head command is not working as it is working in other OS.

list = ['IdleChk_1_E1.txt', '749', 'IdleChk_2_E1.txt', '749', 'reg_fifo_E1.txt', '5922', 'igu_fifo_E1.txt', '161', 'protection_override_E1.txt', '1904', 'fw_asserts_E1.txt', '708', 'McpTrace.txt', '15578', 'phy_dump.txt', '129', 'GrcDumpE1.bin', '3629656']

Even number elements are file name and odd number elements are size.

Here is the command i am using to send input to the shell script:

Process_three=subprocess.Popen("./read.sh %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s" \
                             %(''.join(map(str, list_info[1:2])), ''.join(map(str, list_info[0:1])),\
                               ''.join(map(str, list_info[3:4])), ''.join(map(str, list_info[2:3])),\
                               ''.join(map(str, list_info[5:6])), ''.join(map(str, list_info[4:5])),\
                               ''.join(map(str, list_info[7:8])), ''.join(map(str, list_info[6:7])),\
                               ''.join(map(str, list_info[9:10])), ''.join(map(str, list_info[8:9])),\
                               ''.join(map(str, list_info[11:12])), ''.join(map(str, list_info[10:11])),\
                               ''.join(map(str, list_info[13:14])), ''.join(map(str, list_info[12:13])),\
                               ''.join(map(str, list_info[15:16])), ''.join(map(str, list_info[14:15])),\
                               ''.join(map(str, list_info[17:18])), ''.join(map(str, list_info[16:17])),\
                               file_name), stdout=subprocess.PIPE, shell=True)
(temp, error) = Process_three.communicate()

Here is my shell script.

if [ "$#" -eq 19 ];
then
{
    head -c $1 > $2
    head -c $3 > $4
    head -c $5 > $6
    head -c $7 > $8
    head -c $9 > ${10}
    head -c ${11} > ${12}
    head -c ${13} > ${14}
    head -c ${15} > ${16}
    head -c ${17} > ${18}
} < ${19}
fi

In ESX only first head command output is working.

Is there another way to split the file. I know there is split command but this command split the file into two equal halves. I need dynamic size file. I was hoping if i can do the splitting from python itself. By the way I am new to Python.

Do you wish to split the file by line or chunk size? Is the file text or binary? Or dose it matter? — aquil.abdullah, Apr 17 '17 at 16:28
I want to split the file in chunk size. File contain both text and binary data. — Shminderjit Singh, Apr 17 '17 at 16:34
I recommend this link: https://www.safaribooksonline.com/library/view/programming-python-second/0596000855/ch04s02.html and this link on stackoverflow: http://stackoverflow.com/questions/8096614/split-large-files-using-python — aquil.abdullah, Apr 17 '17 at 16:36
Can you edit your question and add a sample of `list_info`? Without that it is hard to imagine what you want. This is certainly something that can be done in Python without using `head` and `sh`! But you have definitely come up with an inventive solution. — Roland Smith, Apr 17 '17 at 17:11
@RolandSmith I have added a sample of list. At first i was thinking of using Python only but the shell script itself was very easy to implement.Now I am thinking of dropping the shell part as there will be some corner case in future where number of argument will be different. — Shminderjit Singh, Apr 18 '17 at 05:59
@aquil.abdullah Thanks for the links, but in my case each split file has different size and these size I simply cant predict. They are dynamic and unpredictable. — Shminderjit Singh, Apr 18 '17 at 06:03

Roland Smith · Answer 1 · 2017-04-18T18:42:45.080

First, I would suggest to convert your list into a list of 2-tuples, and use integers for the numbers instead of strings. Using it is easier that way. I'm using a list instead of a dict because a list has an order, and a dictionary hasn't.

fragments = [('IdleChk_1_E1.txt', 749), 
             ('IdleChk_2_E1.txt', 749),
             ('reg_fifo_E1.txt', 5922),
             ('igu_fifo_E1.txt', 161),
             ('protection_override_E1.txt', 1904),
             ('fw_asserts_E1.txt', 708),
             ('McpTrace.txt', 15578),
             ('phy_dump.txt', 129),
             ('GrcDumpE1.bin', 3629656)]

Then we open the file in binary mode (I'm using Python 3 here), read the required amount of data and write it to the output files.

with open('inputfile', 'rb') as inf:
    for fn, count in fragments:
        with open(fn, 'wb') as outf:
            outf.write(inf.read(count))

It would be a good idea to check that the sum of all fragment sizes is not greater than the file size. Or you could use -1 for the size of the last fragment, that would make read get all remaining data.

holdenweb · Answer 2 · 2017-04-17T16:50:50.813

It's fairly obvious from your attempted solution that you are new to Python, but you've actually made quite surprising progress to make use of the subprocess library, so I am sure that you'll find that you will do better as time goes by. Often a problem appears difficult because you simply aren't aware of all the available features of the tools available. In this case it seems you are using head because you know that can be forced to do what you want, but I'm sure you'll agree it's not a comfortable solution.

It's difficult to deal with any procedure that takes nineteen arguments - the commands become rather difficult to understand, and it's much easier to make errors in writing them. A data-driven approach, where you describe in a text file how you want your files to be split, is likely to be more tractable. Then you can write a program that reads that description and makes use of it to split the file. Since Python can read and write files quite easily, you should then find there's no need to use shell scripting at all, which will make your solution much more portable.

If I've understood your shell script correctly, each head command takes a certain number of bytes from the file named in the nineteenth(!) argument and writes them out to a nominated file. So you might use a data file layout that contains lines of the form

N filename

where N is the number of lines to beo allow me to test this I created the following file in task_description.txt.

10 file1.txt
20 file2.txt
30 file3.txt

Like your program (if I've got it right) any bytes after the sixty specified will be ignored. So now I can write a program so15.py that reads the task description and processes some data file, named in it first command-line argument, accordingly:

import sys
in_file = sys.argv[1]
with open("task_description.txt") as td, open(in_file, "rb") as inf:
    for line in td:
        n, file_name = line.split()
        with open(file_name, "wb") as out_file:
            out_file.write(inf.read(int(n)))
        print("Wrote", n, "bytes to", file_name)

I then ran this using a data file that had over 60 bytes in it - the Misc/NEWS file from the Python distribution - using the command

python so15.py /Users/sholden/Projects/Python/cpython/Misc/NEWS

It gave the output

Wrote 10 bytes to file1.txt
Wrote 20 bytes to file2.txt
Wrote 30 bytes to file3.txt

As a check I then ran the command

wc -l file*.txt

with the following result

   0       1      10 file1.txt
   2       4      20 file2.txt
   2       6      30 file3.txt
   4      11      60 total

Hopefully you will be able to adapt this to solve your problem fairly easily.

How to split a single file into multiple file of different size using Python

2 Answers2