Split data into multiple files: how to handle (unknown number of) multiple connections

Question

I want to split a (in real life: huge) file into multiple files specified by, say, the second column in data. I.e. in the example below I need files 431.csv and rr1.csv. My main idea was to open new connections to write if not already open - a record of open connections is in the dict files_dict, and then iterate through this and close in the end.

I am stuck in how to refer to these connections line by line.

In real life the number and value of these file names (second column) is not known beforehand.

Found some inspiration here:

write multiple files at a time

python inserting variable string as file name

How can I split a text file into multiple text files using python?

Content of toy data in data_in:

123,431,t
43,rr1,3
13,rr1,43
123,rr1,4

My naive pseudo-code as of now:

files_dict = dict() #dict of file names

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]

        if x not in files_dict:
            fo = x + '.csv'
            files_dict[x] = fo

            '''
            open files_dict[x]
            write line to files_dict[x]

            '''
    else:
        '''
        write line to files_dict[x]
        '''

for fo in files_dict.fos:
    fo.close()

You could do this in pandas in a few lines. Give me a second to write up a solution. Or somebody else might since I'm at the office. — alfonso, Jan 15 '19 at 21:16

score 2 · Accepted Answer · answered Jan 15 '19 at 21:20

2

You do have the right idea, but you should store the file objects rather than the file names in the dict, and you don't need an else block (which should've been aligned with if rather than for):

files_dict = {}

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]
        if x not in files_dict:
            files_dict[x] = open(x + '.csv', 'w')
        files_dict[x].write(line)

for file in files_dict.values():
    file.close()

answered Jan 15 '19 at 21:20

blhsing

91,368
6
71
106

1

Oh of cause! - this is exactly what I was looking for. New to python, I failed to think in file object themselves. – user3375672 Jan 15 '19 at 22:22

score 1 · Answer 2 · answered Jan 15 '19 at 21:20

Put the file objects themselves into the dictionary, not the filenames.

files_dict = {}

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]

        if x not in files_dict:
            fo = open(x + '.csv', "w")
            files_dict[x] = fo
        else:
            fo = files_dict[x]

        fo.write(x)

for fo in files_dict.values():
    fo.close()

score 1 · Answer 3 · answered Jan 15 '19 at 21:25

1

You can also use pandas for your large csv as it handles it nicely, then just iterate through pandas column:

df = pd.read_csv('fun.txt', header=None)

string = "tester string"

for row in df[1]:
    fo = row + '.csv'
    f = open(fo, 'a')
    f.write(string+'\n')
    f.close()

output is 2 files, 431.csv and rr1.csv. Contents of 431.csv:

tester string

contents of rr1.csv:

tester string
tester string
tester string

It will append any added info to duplicate files, I feel like this is the desired behavior based on your pseudocode. This is a good solution because it will open and close your files as it loops through the column. That way you don't have 50 files open at the same time, which could cause troubles for your os.

answered Jan 15 '19 at 21:25

d_kennetz

5,219
5
21
44

1

Since you said your file is huge, opening too many files can cause issues. There is a ulimit on your os for how many files can be open. This is a very simple pandas implementation. I just find it to be a very easy way to organize files like .csv and .txt because it handles all parsing and allows easy access to specific columns. – d_kennetz Jan 15 '19 at 22:17
1

If leaving them open is not an issue, you could also just wait to close the files outside of the loop. – d_kennetz Jan 15 '19 at 22:27

Split data into multiple files: how to handle (unknown number of) multiple connections

3 Answers3