Python file conversion takes more than 2 days

Question

I have a folder which includes around 400 txt files. MAx size of txt file is 2 to 2.5 mb.

I am trying to convert these files to csv with python code. My code perfectly works and quickly converts txt to csv when I have small size of txt( even more than 500 files ) But when size it little heavy it takes quite long time.

Well it's obvious to take long time for heavy data but the problem is I am running this conversion process since 2 days and not even 50% is completed.

Is there any idea to convert these txt file to csv quickly?? I mean withing few hours. If it takes more than 2 days then I will not have enough time to analyze it.

My code is here:

import glob
import os, os.path, glob
import numpy as np
import matplotlib.pyplot as plt
from natsort import natsorted
import pandas as pd
from matplotlib.patches import Ellipse
from matplotlib.text import OffsetFrom


from mpl_toolkits.mplot3d import Axes3D
from random import random

data_folder = "./all/"
data_folder
files = natsorted(glob.glob(data_folder + 'dump*.data'))
number_of_files = len(files)
#print(number_of_files)
#files

file_open = open("./all/dump80000.data", "r")
with open("./all/dump80000.data") as f:
  lines = f.readlines()
#removing 'ITEM:' 
s = 'ITEM: ATOMS '
lines[8] = lines[8].replace(s, '')

#getting the header names
headers = lines[8].split()

headers.append('TIMESTEP')
df = pd.DataFrame(columns=headers)

counter = 0
for total_files in range(number_of_files):
    with open(files[total_files]) as f:
        lines = f.readlines()
        total_atoms = int(lines[3])
        for i in range(total_atoms):
            row_elements = lines[9+i].split()
            row_elements.append(int(lines[1]))
            df.loc[counter] = row_elements
            counter=counter+1

    df.to_csv(r'all.csv', index = False)

Any idea ? Suggestion?

Thank you

In case, if you need txt sample:

https://raw.githubusercontent.com/Laudarisd/dump46000.data

or

https://raw.githubusercontent.com/Laudarisd/test/main/dump46000.data

One of the *first* things you should always do when working on performance issues is to profile your code. See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) — martineau, Dec 29 '21 at 02:23

j1-lee · Accepted Answer · 2021-12-29T02:34:29.797

1

How about using simple readline? I am suspect readlines and/or pd.DataFrame are consuming so much time. The following seems to be fast enough for me.

import glob
import time

start = time.time()

data_folder = "./all/"
files = glob.glob(data_folder + 'dump*.data')

# get header from one of the files
with open('all/dump46000.data', 'r') as f:
    for _ in range(8):
        next(f) # skip first 8 lines
    header = ','.join(f.readline().split()[2:]) + '\n'

for file in files:
    with open(file, 'r') as f, open(f'all.csv', 'a') as g: # note the 'a'
        g.write(header) # write the header
        for _ in range(9):
            next(f) # skip first 9 lines
        for line in f:
            g.write(line.rstrip().replace(' ', ',') + '\n')

print(time.time() - start)

# id,type,x,y,z,vx,vy,vz,fx,fy,fz
# 201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
# 623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
# 851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
# 159,1,-0.0268728,-0.186269,1.51979,-0.000262947,-0.000386994,0.000254515,0.00961213,-0.00640215,-0.0397847

edited Dec 29 '21 at 02:34

answered Dec 29 '21 at 01:38

j1-lee

13,764
3
14
26

Thanks for you answer. I have more than 400 txt file in a folder. Using with open doesn't loop in all files I guess. This is only for one file right? – Codeholic Dec 29 '21 at 01:41
Of course you can modify it to be embedded in a for loop. – j1-lee Dec 29 '21 at 01:42
let me try,, thanks. – Codeholic Dec 29 '21 at 01:45
@Codeholic Check the updated answer. This will generate, for example, `./all/dump46000.data.csv`. – j1-lee Dec 29 '21 at 01:45
Thanks, I was doing the same,...Let me write my csv and I will come back to vote your answer. – Codeholic Dec 29 '21 at 01:47
Thanks, I was doing the same,...Let me write my csv . It seems that it's not writing all txt to csv. It generates csv of 2mb. Did I miss anything ? – Codeholic Dec 29 '21 at 02:02
It generates one file for each `.data` file. Did you want to concatenate all the results in a single output file? – j1-lee Dec 29 '21 at 02:06
For your example input, `dump46000.data`, the code seems to work fine; I copied it 400 times, and converted them all in less than 10 seconds, even on my slow Celeron computer. Each csv file is of size 1.8M. The number of rows in the result matches the number of atoms, in this case `19582`. So I don't understand what you mean by "It generates csv of 2mb." – j1-lee Dec 29 '21 at 02:07
Oh, you mean each csv for each txt. I didn't concatenate. Apologies. – Codeholic Dec 29 '21 at 02:11
Oh if you want to concatenate them all, I can modify the code. – j1-lee Dec 29 '21 at 02:12
Would be nice, Could you help me? I want to create a single csv which includes all txt information with single header. – Codeholic Dec 29 '21 at 02:12
@Codeholic Sure, check the updated answer :) – j1-lee Dec 29 '21 at 02:18
1

Thanks a lot, seems it did convert to one csv. I have to confirm my txt data and csv data for sure. Thanks a lot. For now your answer is acceptable. – Codeholic Dec 29 '21 at 02:30

score 0 · Answer 2 · answered Dec 29 '21 at 01:31

Taking a quick glance at your code, it seems you're taking the following approach to convert a file:

Open the file
Read the entire file into a buffer
Process the buffer

However, if you can make some small adjustments to your code:

Open the file
Read one line
Process the line
Continue until the file is done

Basically, take an iterative approach instead of reading the whole file all at once. Next, you can then make it even faster using asyncio, where you can process all your files concurrently.

You are right, Could I get little bit hint to edit my code? It would be a great help. — Codeholic, Dec 29 '21 at 01:33

score 0 · Answer 3 · answered Dec 29 '21 at 01:32

It's hard to give precise help without knowing exactly what data you want to extract from those files but from a first glance you definitely should use one of pandas' built-in file reading methods which are guaranteed to be many times faster than your code. Assuming you wish to skip the first 9 rows, you could do something like:

headers = ["a", "b", ...]
pd.read_csv(open("./all/dump80000.data"), skiprows=9, sep=" ", columns=headers)

If this is still not fast enough, you can parallelize your code since most of the processing is just loading data into memory.

Hi, thanks a lot. I mentioned a sample data link in my problem. You are right that I want to convert my data to csv after 9th row. 8th row includes data header in all txt files. after that all rows includes information about the headers. — Codeholic, Dec 29 '21 at 01:36

score 0 · Answer 4 · answered Dec 29 '21 at 02:18

I recommend breaking the problem down into distinct steps for a few files, then once you're sure you understand how to correctly code each step independently, you can think about combining them:

convert all TXT to CSVs
process each CSV doing what you need

Here's how to do step 1:

import csv

out_f =  open('output.csv', 'w', newline='')
writer = csv.writer(out_f)
in_f = open('input.txt')

# Consume first 8 lines you don't want
for _ in range(8):
    next(in_f)

# Get and fix-up your header
header = next(in_f).replace('ITEM: ATOMS ', '')
writer.writerow(header.split())

# Read the rest of the file line-by-line, splitting by space, which will make a row that the CSV writer can write
for line in in_f:
    row = line.split()
    writer.writerow(row)

in_f.close()
out_f.close()

When I ran that against your sample .data file, I got:

id,type,x,y,z,vx,vy,vz,fx,fy,fz
201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
...

Do that for all 400 TXT files, then write another script to process the resulting CSVs.

I'm on an M1 Macbook Air, with a good, fast SSD. Converting that one .data file takes less than point-one seconds. Unless you've got a really slow disk, I cannot see both steps taking more than hour.

Thanks a lot, For now I tried to manage it. – Codeholic Dec 29 '21 at 02:31 — Codeholic, Dec 29 '21 at 02:31

Python file conversion takes more than 2 days

4 Answers4