1

I have a CSV file that looks as follows:

ex1

This is an reduced example. My original CSV contains thousands of rows. Now I my aim is to cut this CSV into different smaller pieces. However, I want to have the header in each file and the header consists of the first 8 rows, the 9th contains a blank line and this also belongs to the header.

As I am not interested in the data and just cutting it, I tried to find a solution which is fast and does not involve any data modification. I do not want to use Pandas. I would like to avoid reading and writing with csv module, but I also tried to use the implementation mentioned in an answer to Splitting one csv into multiple files. Problem here is that my header consists not of just one row, but 9 rows in total. And I would like to avoid reading and writing data, as I actually just need to cut it.

I tried to use islice.

The first CSV can be created as follows:

from itertools import islice

with open(r"C:\Desktop\myfolder\mycsv.csv") as f, open (r"C:\Desktop\myfolder\out.csv", "w") as out:
    r = islice(f, 0, 12)
    out.writelines(r)

That gives my first CSV, containing 12 rows (I think), including the header. Now I do not know how to cut my second file in such way that it contains data from the 13th row, until lets say 20th row, but including the header. I am not sure if this can be done with islice. I know how cut to get the data with setting start and stop islice(f, 13 20), but then the header is missing. I also thought about cutting one file, the header and then the data file and attaching them. But I couldn't find a way to attach two CSVs, without having to read and write or make use of pandas.

BertHobe
  • 217
  • 1
  • 14
  • is your header `Column1, ..., Column4`? – Ricardo Sanchez Apr 21 '21 at 06:56
  • Have you looked at command line solutions like [this](https://stackoverflow.com/questions/51420966/split-csv-files-into-smaller-files-but-keeping-the-headers)? You could invoke this from your python code or jupyter notebook as well. – petezurich Apr 21 '21 at 07:38
  • The format you describe isn't standard CSV format, so would be problematic to handle with the `csv` module or by `pandas` anyway. You need to open the `out` file as mode `"w"` for writing to it (the default is `"r"` for reading). Don't post images of textual data — no one can use them for testing their answers. Also, be careful using existing answers to CSV file questions here because many of them are for Python 2, not 3, and there are significant differences between the two wrt CSV file handling. – martineau Apr 21 '21 at 07:44
  • @RicardoSanchez As I wrote my header consists of 9 rows, so not only Column1, ...., Column4. – BertHobe Apr 21 '21 at 07:59
  • @martineau I added the "w". The example should illustrate how my data looks. And it looks like this. It is a csv file. It is separated with "\t". Not semicolon. – BertHobe Apr 21 '21 at 07:59
  • @petezurich Thanks for the link. No, I haven't tried, nor do I know how it works. – BertHobe Apr 21 '21 at 08:00
  • 1
    If the basic file format are rows of values separated with a single character, you _can_ use many of Python's `csv` module's function as long as you avoid those that deal with header rows — so I suggest you do so. Basically it sounds like all you need to do is read the first **N** rows of the file the comprise its "header", and then write them at the beginning of each of your output files. If you had some usable sample data in your question, someone might actually post an answer with some code in it… – martineau Apr 21 '21 at 08:06

1 Answers1

1

I would try something like this:

  • read the file line by line
  • remember the header lines
  • start a new output file whenever needed, and prepend the header lines every time

Code:

a.csv:

c1 c2 
c3 c4

d1 d2
d3 d4
d5 d6
d7 d8

script.py:

with open("a.csv", "r") as inp:
    header = []
    headersize = 3
    for _ in range(0, headersize):
        header.append(inp.readline())

    filenumber = 1
    chunksize = 2
    dataline = inp.readline()
    while dataline != "":
        with open("a_" + str(filenumber) + ".csv", "w") as outp:
            for headerline in header:
                outp.write(headerline)
            for _ in range(0, chunksize):
                outp.write(dataline)
                dataline = inp.readline()
        filenumber += 1

a_1.csv:

c1 c2 
c3 c4

d1 d2
d3 d4

a_2.csv:

c1 c2 
c3 c4

d5 d6
d7 d8

Instead of using a fixed chunksize, you can implement any other condition for starting a new file.

maij
  • 4,094
  • 2
  • 12
  • 28