Splitting a large file into chunks

Question

I've a file with 7946479 records, i want to read the file line by line and insert into the database(sqlite). My first approach was open the file read the records line by line and insert into the database at the same time, since it dealing with huge amount of data it taking very long time.I want to change this naive approach so when i searched on internet i saw this [python-csv-to-sqlite][1] in this they have the data in a csv file but the file i have is dat format but i like the answer to that problem so now i am trying to do it like in the solution. https://stackoverflow.com/questions/5942402/python-csv-to-sqlite

The approach they using is like first they splitting the whole file into chunks then doing the database transaction instead of writing each record one at a time.

So i started writing a code for splitting my file into chunks Here is my code,

file = r'files/jan.dat'
test_file = r'random_test.txt'


def chunks(file_obj, size=10000):
counter = 0
file_chunks = []
temp_chunks = []

for line in file_obj:
    if line == '\n':
        continue
    if counter != size:
        temp_chunks.append(line)
        counter += 1
    else:
        file_chunks.append(temp_chunks)
        temp_chunks = []
        counter = 0
file_obj.close()
if len(temp_chunks) != 0:
    file_chunks.append(temp_chunks)

yield file_chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

the output is 795, but what i wanted is to split the whole file into chunks of size 10000

i can't figure out what is going wrong here, i can't share my whole file here so for testing can use this code to generate a file with 7946479 lines

TEXT = 'Hello world'
FILE_LENGTH = 7946479

counter = 0
with open(r'random_test.txt', 'w') as f:
    for _ in range(FILE_LENGTH):
        f.write(f"{TEXT}\n")

this is how my original file looks like (the file format is dat)

lat lon day mon t2m rh2m    sf  ws
5   60  1   1   299.215 94.737  209.706 5.213
5   60.25   1   1   299.25  94.728  208.868 5.137
5   60.5    1   1   299.295 94.695  207.53  5.032
5   60.75   1   1   299.353 94.623  206.18  4.945
5   61  1   1   299.417 94.522  204.907 4.833
5   61.25   1   1   299.447 94.503  204.219 4.757
5   61.5    1   1   299.448 94.525  203.933 4.68
5   61.75   1   1   299.443 94.569  204.487 4.584
5   62  1   1   299.44  94.617  204.067 4.464

In the function you iterate over `file_chunks` which is always empty list (assigned two rows before). You don't read anything from file at all... You can iterate over `file_obj` instead (note the absence of `readlines` or similar method: it's lazy and does not require loading full file to memory). — STerliakov, Jan 06 '22 at 17:19
@SUTerliakov yes there is the problem i forgot to give ```file_obj``` as the iterator instead i put ```file_chunks``` — ABHIJITH EA, Jan 06 '22 at 17:23
Does this answer your question? [Python how to read N number of lines at a time](/q/6335839/90527) — outis, Sep 28 '22 at 21:53

Yohann Boniface · Accepted Answer · 2022-01-06T17:48:03.610

5

An easy way to chunk the file is to use f.read(size) until there is no content left. However this method works with character number instead of lines.

test_file = 'random_test.txt'


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.read(size):
            yield content


if __name__ == '__main__':
    split_files = chunks(test_file)
    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 143 characters

Same Function with lines

test_file = "random_test.txt"


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.readline():
            for _ in range(size - 1):
                content += f.readline()

            yield content.splitlines()


if __name__ == '__main__':
    split_files = chunks(test_file)

    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 6479 lines

edited Jan 06 '22 at 17:48

answered Jan 06 '22 at 17:29

Yohann Boniface

494
3
10

Hi, how that ```while content := f.read(size)``` works i mean ```:=``` this syntax – ABHIJITH EA Jan 06 '22 at 17:40
1

It is the walrus operator that assigns value in place. @ABHIJITHEA – Vishnudev Krishnadas Jan 06 '22 at 17:46
@Vishnudev Thank you :) – ABHIJITH EA Jan 06 '22 at 17:52
@ABHIJITHEA Is this answer suiting your need? – Yohann Boniface Jan 06 '22 at 17:53
@YohannBoniface yes this answer suit's my need. – ABHIJITH EA Jan 06 '22 at 18:00

score 1 · Answer 2 · answered Jan 06 '22 at 17:36

test_file = r'random_test.txt'

def chunks(file_obj, size=10000):
    counter, chunks = 0, []
    for line in file_obj:
        if line == '\n':
            continue
        counter += 1
        chunks.append(line)
        if counter == size:
            yield chunks
            counter, chunks = 0, []
    file_obj.close()
    if counter:
        yield chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

This outputs a ton of 10000 and 6479 at the end. Not that yield keyword is really more suitable here, but it's absolutely useless in place where you used it. yield helps to create a lazy iterator: new chunk will be read from file only when we request it. This way we don't read the full file in memory.

Thanks it's a great solution. – ABHIJITH EA Jan 06 '22 at 18:16 — ABHIJITH EA, Jan 06 '22 at 18:16

Vishnudev Krishnadas · Answer 3 · 2022-01-06T17:49:04.683

1

Simply read it using pandas.read_csv with the chunksize argument

chunks = pd.read_csv('jan.dat', sep='\s+', chunksize=1000)

for chunk in chunks:
    # Process here

You can also use pandas.DataFrame.to_sql to push it to the database.

edited Jan 06 '22 at 17:49

answered Jan 06 '22 at 17:43

Vishnudev Krishnadas

10,679
2
23
55

The file have alignment issue that's why i didn't tried any thing to directly push it database – ABHIJITH EA Jan 06 '22 at 17:48
Hi, thank this works! – ABHIJITH EA Jan 06 '22 at 18:15

C.L. · Answer 4 · 2022-01-10T18:52:42.580

As a solution to your problem of the tasks taking too long, I would suggest using multiprocessing instead of chunking the text (as it would take just as long but in more steps). Using the multiprocessing library allows multiple processing cores to perform the same task in parallel, resulting in shorter run time. Here is an example.

import multiprocessing as mp

# Step 1: Use multiprocessing.Pool() and specify number of cores to use (here I use 4).
pool = mp.Pool(4)

# Step 2: Use pool.starmap which takes a multiple iterable arguments
results = pool.starmap(My_Function, [(Parameter1, Parameter2, Parameter3) for i in data])
    
# Step 3: Don't forget to close
pool.close()

I will try this i think it will help me to improve the speed, i didn't thought this way. Thank you! — ABHIJITH EA, Jan 06 '22 at 17:34

Splitting a large file into chunks

4 Answers4