1

I've a file with 7946479 records, i want to read the file line by line and insert into the database(sqlite). My first approach was open the file read the records line by line and insert into the database at the same time, since it dealing with huge amount of data it taking very long time.I want to change this naive approach so when i searched on internet i saw this [python-csv-to-sqlite][1] in this they have the data in a csv file but the file i have is dat format but i like the answer to that problem so now i am trying to do it like in the solution. https://stackoverflow.com/questions/5942402/python-csv-to-sqlite

The approach they using is like first they splitting the whole file into chunks then doing the database transaction instead of writing each record one at a time.

So i started writing a code for splitting my file into chunks Here is my code,

file = r'files/jan.dat'
test_file = r'random_test.txt'


def chunks(file_obj, size=10000):
counter = 0
file_chunks = []
temp_chunks = []

for line in file_obj:
    if line == '\n':
        continue
    if counter != size:
        temp_chunks.append(line)
        counter += 1
    else:
        file_chunks.append(temp_chunks)
        temp_chunks = []
        counter = 0
file_obj.close()
if len(temp_chunks) != 0:
    file_chunks.append(temp_chunks)

yield file_chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

the output is 795, but what i wanted is to split the whole file into chunks of size 10000

i can't figure out what is going wrong here, i can't share my whole file here so for testing can use this code to generate a file with 7946479 lines

TEXT = 'Hello world'
FILE_LENGTH = 7946479

counter = 0
with open(r'random_test.txt', 'w') as f:
    for _ in range(FILE_LENGTH):
        f.write(f"{TEXT}\n")

this is how my original file looks like (the file format is dat)

lat lon day mon t2m rh2m    sf  ws
5   60  1   1   299.215 94.737  209.706 5.213
5   60.25   1   1   299.25  94.728  208.868 5.137
5   60.5    1   1   299.295 94.695  207.53  5.032
5   60.75   1   1   299.353 94.623  206.18  4.945
5   61  1   1   299.417 94.522  204.907 4.833
5   61.25   1   1   299.447 94.503  204.219 4.757
5   61.5    1   1   299.448 94.525  203.933 4.68
5   61.75   1   1   299.443 94.569  204.487 4.584
5   62  1   1   299.44  94.617  204.067 4.464
ABHIJITH EA
  • 308
  • 4
  • 13

4 Answers4

5

An easy way to chunk the file is to use f.read(size) until there is no content left. However this method works with character number instead of lines.

test_file = 'random_test.txt'


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.read(size):
            yield content


if __name__ == '__main__':
    split_files = chunks(test_file)
    for chunk in split_files:
        print(len(chunk))

For the last chunk, it will take whatever left, here 143 characters


Same Function with lines

test_file = "random_test.txt"


def chunks(file_name, size=10000):
    with open(file_name) as f:
        while content := f.readline():
            for _ in range(size - 1):
                content += f.readline()

            yield content.splitlines()


if __name__ == '__main__':
    split_files = chunks(test_file)

    for chunk in split_files:
        print(len(chunk))


For the last chunk, it will take whatever left, here 6479 lines

Yohann Boniface
  • 494
  • 3
  • 10
1
test_file = r'random_test.txt'

def chunks(file_obj, size=10000):
    counter, chunks = 0, []
    for line in file_obj:
        if line == '\n':
            continue
        counter += 1
        chunks.append(line)
        if counter == size:
            yield chunks
            counter, chunks = 0, []
    file_obj.close()
    if counter:
        yield chunks

if __name__ == '__main__':
    split_files = chunks(open(test_file))
    for chunk in split_files:
        print(len(chunk))

This outputs a ton of 10000 and 6479 at the end. Not that yield keyword is really more suitable here, but it's absolutely useless in place where you used it. yield helps to create a lazy iterator: new chunk will be read from file only when we request it. This way we don't read the full file in memory.

STerliakov
  • 4,983
  • 3
  • 15
  • 37
1

Simply read it using pandas.read_csv with the chunksize argument

chunks = pd.read_csv('jan.dat', sep='\s+', chunksize=1000)

for chunk in chunks:
    # Process here

You can also use pandas.DataFrame.to_sql to push it to the database.

Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
0

As a solution to your problem of the tasks taking too long, I would suggest using multiprocessing instead of chunking the text (as it would take just as long but in more steps). Using the multiprocessing library allows multiple processing cores to perform the same task in parallel, resulting in shorter run time. Here is an example.

import multiprocessing as mp

# Step 1: Use multiprocessing.Pool() and specify number of cores to use (here I use 4).
pool = mp.Pool(4)

# Step 2: Use pool.starmap which takes a multiple iterable arguments
results = pool.starmap(My_Function, [(Parameter1, Parameter2, Parameter3) for i in data])
    
# Step 3: Don't forget to close
pool.close()

C.L.
  • 106
  • 6