0

I have a .csv file that includes hundreds of millions of rows (yes, big data), and I want to use Python to delete the last row of it. I do know some methods that follow the read-delete-rewrite process. For example, use pandas library, pd.read_csv() to read it first, use .drop() to drop the last row, and then use .to_csv() to overwrite/rewrite the file. This works, but too slow as this file includes hundreds of millions of rows ... So, is there a simple direct method that can work faster for such big data without these traditional three steps? Thanks!

  • 3
    Is it feasible for you to move to a different platform or language, like SQL? In all honesty, it sounds like the problem is actually the way you're storing these data points. SQL can manage these kinds of tasks and execute them within milliseconds if not faster. I don't mean to sidetrack from your actual question, but I feel like using SQL would solve this problem and set you up for easier data retrieval and manipulation in the future. – Calc-You-Later Jun 14 '21 at 01:10
  • If you have unix tools available, I would use `sed`, as you can delete specific lines by line number see here: https://stackoverflow.com/questions/2112469/delete-specific-line-numbers-from-a-text-file-using-sed – Alex Jun 14 '21 at 01:13
  • No, there is no "direct simple way", read and re-write *is the simple way*. – juanpa.arrivillaga Jun 14 '21 at 01:58
  • 2
    Also, **don't use pandas for this**. Pands is for complex, *numeric calculations* and data transformations. Reading a csv and filtering some lines is *very basic* and should just be done with the built-in `csv` module, probably more efficiently at that – juanpa.arrivillaga Jun 14 '21 at 01:59
  • 1
    Thank you all for your comments. I do agree with @Calc-You-Later that I may use python-based SQL to handle it. I was just a bit lazy to apply SQL (not a fan of it since I was a student). – QuestionStudent Jun 14 '21 at 03:19

1 Answers1

0

I would not use Python at all. Just use Unix command-line tools. Here's an example using the head command to skip the nth last line. That being said, if you want to do anything more complex then skipping the last line, then you should put this file into a database, as the commenter above recommended. Doing anything meaningful with data this size is not feasible in Python - you need a database.

sparc_spread
  • 10,643
  • 11
  • 45
  • 59
  • 1
    There's nothing wrong with using Python for this instead of Unix command-line tools. Particularly if you aren't on a *nix OS. But honestly, at this scale, you and the commenter are probably right - just use a database. – juanpa.arrivillaga Jun 14 '21 at 01:59
  • Point taken. Agreed with you by the way re using `csv` instead of `pandas`. – sparc_spread Jun 15 '21 at 02:52