0

I have put numerous CSV files in a fold and would like to skip the certain row (e.g. the 10th row) first, and then take one row every five lines.
I could do the first step however have no idea about the second one.

Thanks.

import pandas as pd
import csv, os


# Loop through every file in the current working directory.
for csvFilename in os.listdir('path'):
    if not csvFilename.endswith('.csv'):
        continue
    # Now let's read the dataframe
    # total row number
    total_line = len(open('path' + csvFilename).readlines())
    # put the first and last to a list
    line_list = [total_line] + [1]
    df = pd.read_csv('path' + csvFilename, skiprows=line_list)
    new_file_name = csvFilename

    # And output
    df.to_csv('path' + new_file_name, index=False)

The correct code is shown as follows.

import numpy as np
import pandas as pd
import csv, os

# Loop through every file in the current working directory.
for csvFilename in os.listdir('path'):
    if not csvFilename.endswith('.csv'):
        continue
    # Now let's read the dataframe
    total_line = len(open('path' + csvFilename).readlines())
    skip = np.arange(total_line)
    # skip 5 rows
    skip = np.delete(skip, np.arange(0, total_line, 5))
    # skip the certain row you would like, e.g. 10
    skip = np.append(skip, 10)
    df = pd.read_csv('path' + csvFilename, skiprows=skip)

    new_file_name = '2' + csvFilename
    # And output
    df.to_csv('path' + new_file_name, index=False)
Neil
  • 13
  • 5
  • 2
    Does this answer your question? [Select every nth row as a Pandas DataFrame without reading the entire file](https://stackoverflow.com/questions/53812094/select-every-nth-row-as-a-pandas-dataframe-without-reading-the-entire-file) – Shaido Apr 29 '20 at 09:11
  • 1
    You can [edit] the question if you want to add something, or if you have an answer you can add that (it's fine to answer your own question). If the question I linked answered your question, you can accept the duplicate. :) – Shaido Apr 29 '20 at 09:37
  • Thank you for your help. I have updated my code, however, there are still some problems. – Neil Apr 29 '20 at 09:40
  • No problems. `skip` contains the rows you want to skip so you need to remove the lines `np.delete(skip, total_line-1, 0)` and `np.delete(skip, 1, 0)`. For the last one, you should probably start from 1: `np.delete(skip, np.arange(1, total_line, 5))`. For the last line, you need to make sure it is in the `skip` list or you can use the `skipfooter` parameter in `read_csv`. – Shaido Apr 29 '20 at 09:47
  • Thanks. How about if skipping a certain row? e.g. the fifth row? – Neil Apr 29 '20 at 09:56
  • For that you still have to rely on `skiprows` as in the linked question / your updated answer. – Shaido Apr 29 '20 at 10:02
  • 1
    Thank you for your help. I have solved this problem. – Neil Apr 29 '20 at 12:23

1 Answers1

1

You can use a function with skiprows.

I edited your code below:

    import numpy as np  
    import csv, os  

    # Loop through every file in the current working directory.
    for csvFilename in os.listdir('path'):
        if not csvFilename.endswith('.csv'):
            continue
        # Now let's read the dataframe
        total_line = len(open('path' + csvFilename).readlines())

        df = pd.read_csv('path' + csvFilename, skiprows=lambda x: x in list(range(total_line))[1:-1:5])

        new_file_name = csvFilename
        # And output
        df.to_csv('path' + new_file_name, index=False)

Mo Huss
  • 434
  • 2
  • 11
  • There is something wrong. If I do so, it would skip what I really want. – Neil Apr 29 '20 at 11:59
  • you can change this "[1:-1:5]" part of the code to either "[1:-1:6]" or change it to "[1:-1:4]" and you will get exactly what you want. – Mo Huss Apr 29 '20 at 14:13