0

I would like to use python to delete the header and the 1st row of a huge csv file (3GB) with good performance.

import csv
import pandas as pd

def remove2rows(csv_file):
    data = pd.read_csv(csv_file)
    data = data.iloc[1:]
    data.to_csv(csv_file, header=None, index=False)

if __name__ == "__main__":
    remove2rows(filename)

This script works but takes some time, probably because it reads the whole file and it writes every row starting from row 3 to the end of the file to a new csv file.

Is there any ways that can improve the performance?

sotech
  • 1
  • 2
  • 1
    Not Python but very likely much faster: https://stackoverflow.com/questions/9633114/unix-script-to-remove-the-first-line-of-a-csv-file – petezurich Dec 18 '19 at 10:51
  • 1
    Hi, welcome to SO! Stack overflow is here to help you with code that generally doesn't work. If you're looking for code to be reviewed, and want to know about improvements take a look over at https://codereview.stackexchange.com/ instead! – Remy Dec 18 '19 at 11:01
  • @petezurich yea, i found this site as well. and i am trying to use the "sed" command in python, `import subprocess def testing(filename): cmd = "sed -i '' 1d %s" %filename subprocess.call(cmd, shell=True)` error message: 'sed' is not recognized as an internal or external command, operable program or batch file. – sotech Dec 19 '19 at 02:57
  • Have you made sure that sed is installed and that you can execute from your shell? – petezurich Dec 19 '19 at 06:28

2 Answers2

0

Note that the only way to "remove lines from a file" IS to read the whole file (though not necessarily all at once xD) and write back selected lines to a new file. That's how files work.

But you'd certainly save time by not using panda here - panda is a tool for doing computations on tabular data, not a file utility. Using the stdlib's csv module or even more simply just plain file features (if you are 101% sure your csv doesn't contains embedded newlines) would probably be more efficient, at least wrt/ memory use, and probably wrt/ raw perfs.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
0

Question: Delete first two rows of a huge csv file

This exampel do:
Find the offset of the second NewLine, change the file position to it and copy to the end of the file.

Report back if you gain any improved performance!


Reference:

import io, shutil

DATA = b"""First line to be skipped
Second line to be skipped
Data Line 1
Data Line 2
Data Line 3
"""

def main():    
    # with open('in_filename', 'rb') as in_fh, open('out_filename', 'wb') as out_fh:
    with io.BytesIO(DATA) as in_fh, io.BytesIO() as out_fh:

        # Find the offset of the second NewLine
        # Assuming it within the first 70 bytes
        # Assuming NO embeded NewLine
        # Adjust it to your needs
        buffer = in_fh.read(70)

        offset = 0
        for n in range(2):
            offset = buffer.find(b'\n', offset) + 1

        print('Change the file position to: {}'.format(offset))
        in_fh.seek(offset)

        # Copy to the end of the file
        shutil.copyfileobj(in_fh, out_fh)

        # This is only for demo printing the result
        print(out_fh.getvalue())

if __name__ == "__main__":
    main()

Output:

Change the file position to: 59
b'Data Line 1\nData Line 2\nData Line 3\n'

Tested with Python: 3.5

stovfl
  • 14,998
  • 7
  • 24
  • 51