1

I have a huge data frame with columns names:

A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n

Using unix/bash or python, I want to produce n individual files with the following columns:

A,B,C,D,F,G,H,GT_a,N_a_,E_a

A,B,C,D,F,G,H,GT_b,N_b_,E_b

A,B,C,D,F,G,H,GT_c,N_c_,E_c

....

A,B,C,D,F,G,H,GT_n,N_n_,E_n

Each file should be called: a.txt, b.txt, c.txt,...,n.txt

thanasisp
  • 5,855
  • 3
  • 14
  • 31
Lucas
  • 1,139
  • 3
  • 11
  • 23

3 Answers3

1
import pandas as pd
import numpy as np

c = "A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,GT_d,N_d_,E_d,GT_e,N_e_,E_e".split(',')
df = pd.DataFrame(np.full((30, 22), c), columns=c)

c = None
c = list(df.columns)
default = c[:7]
var = np.matrix(c[7:])
var = pd.DataFrame(var.reshape(var.shape[1]//3, 3))

def dump(row):
    cols = default + list(row)
    magic = cols[-1][-1]
    df[cols].to_csv(magic + '.txt')

var.apply(dump, axis=1)
volante
  • 154
  • 2
  • 8
  • Thanks very much @volante ! But how do you do it when you have >1,000,000 columns and you want to create the vector c with the column names? – Lucas Aug 09 '20 at 20:11
  • Hi @volante, I mean, if you have >1,000,000 columns and you are reading a real dataframe instead as a string vector with the column names. Thanks very much – Lucas Aug 09 '20 at 21:14
  • @Lucas Answer updated to pull column names from the DataFrame. Is that what you meant? – volante Aug 10 '20 at 03:03
1

This should write out the different files, with different headers for each file. You'll have to change the COL_NAMES_TO_WRITE to be the ones that you want.

It uses the standard library, so no pandas. It won't write out more than 26 different files.. but the filename generator could be changed to augment that and allow that.

If I'm interpreting this question correctly, you want to split this into 14 files (a..n)

You'll have to copy this code below into a file, splitter.py And then run this command: python3.8 splitter.py --fn largefile.txt -n 14

Where largefile.txt is your huge file that you need to split.

import argparse
import csv
import string

COL_NAMES_TO_WRITE = "A,B,C,D,F,G,H,GT_{letter},N_{letter},E_{letter}"
WRITTEN_HEADERS = set()  # place to keep track of whether headers have been written

def output_file_generator(num):
    if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}")

    i = 0
    while True:
        prefix = string.ascii_lowercase[i]
        i = (i + 1) % num  # increment modulo number of files we want
        yield f"{prefix}.txt"

def col_name_generator(num):
    i = 0
    while True:
        col_suffix = string.ascii_lowercase[i]
        i = (i + 1) % num  # increment modulo number of files we want
        print( COL_NAMES_TO_WRITE.format(letter=col_suffix).split(','))
        yield COL_NAMES_TO_WRITE.format(letter=col_suffix).split(',')

def main(filename, num_files=4):
    """Split a file into multiple files

    Args:
        filename (str): large filename that needs to be split into multiple files
        num_files (int): number of files to split filename into
    """
    print(filename)
    with open(filename, 'r') as large_file_fp:
        reader = csv.DictReader(large_file_fp)
        output_files =  output_file_generator(num_files)
        col_names = col_name_generator(num_files)
        for line in reader:
            print(line)
            filename_for_this_file = output_files.__next__()
            print("filename ", filename_for_this_file)
            column_names_for_this_file = col_names.__next__()
            print("col names:", column_names_for_this_file)

            with open(filename_for_this_file, 'a') as output_fp:

                writer = csv.DictWriter(output_fp, fieldnames=column_names_for_this_file)
                if filename_for_this_file not in  WRITTEN_HEADERS:
                    writer.writeheader()
                    WRITTEN_HEADERS.add(filename_for_this_file)
                just_these_fields = {k:v for k,v in line.items() if k in column_names_for_this_file}
                writer.writerow(just_these_fields)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-fn", "--filename", required=True, default='large_file.txt', help="filename of large file to be split")
    parser.add_argument("-n", "--num_files", required=False, default=4, help="number of separate files to split large_file into")
    args = parser.parse_args()
    main(args.filename, int(args.num_files))
voortuck
  • 3
  • 2
  • Thanks @voortuck !. When I run your code I get: "File "split_df_into_snp_files.py", line 12 if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}") ^ SyntaxError: invalid syntax" . In reality I'd like to split the big file into 893 files (I replaced 26 by 893). After the 7th column, all columns in the big file are called like this: GT_rs1637537, N_rs1637537, E_rs1637537, GT_rs9487596, N_rs9487596, E_rs9487596, and so on. So at the end I'd get 953 files with 10 columns each. – Lucas Aug 10 '20 at 21:29
  • Hi @voortuck, I run it again (python3 splitter.py -fn myfile.txt -n 893) and I get: Traceback (most recent call last): File "splitter.py", line 59, in main(args.filename, int(args.num_files)) File "splitter.py", line 37, in main for line in reader: File "/usr/lib/python3.6/csv.py", line 111, in next self.fieldnames File "/usr/lib/python3.6/csv.py", line 98, in fieldnames self._fieldnames = next(self.reader) _csv.Error: field larger than field limit (131072) – Lucas Aug 10 '20 at 21:49
  • It looks like you have more than 131072 columns which exceeds the csv's libraries abilities. You might have to add: `csv.field_size_limit(sys.maxsize)` as mentioned [here](https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072) – voortuck Aug 11 '20 at 21:06
1

Here are a couple of solutions with bash tools.


1. bash

Using cut inside a bash loop.This will raise n processes and parse the file n times.

Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk and print them every 8th, 11th, etc column into the bash array.

#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )

for i in "${!ids[@]}"; do
    cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
    cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done

Usage: bash test.sh file


2. awk

Or you can use awk. Here I customize just the number of outputs, but the others can also be done like in the first solution.

BEGIN { FS=OFS=","; times=14 } 
{ 
  for (i=1;i<=times;i++) {
    print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
  }
}

Usage: awk -f test.awk file.

This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.

thanasisp
  • 5,855
  • 3
  • 14
  • 31
  • Thanks @thanasisp !. But my file has >10000 columns, and the column names given are just toy examples. In reality column names are of the form: GT_rs1637537, N_rs1637537, E_rs1637537, GT_rs9487596, N_rs9487596, E_rs9487596, etc. Thanks very much for your help – Lucas Aug 10 '20 at 12:24
  • You are right @thanasisp, and your solution is right. Would you be so kind as to indicate me how can I adjust the solution to column names of the form GT_rs1637537, GT_rs163753788, etc?. Thanks very much and apologize for the inconvenient. – Lucas Aug 10 '20 at 13:53
  • Thanks @thanasisp ! For some reason I don't understand, when I apply your .sh script, I get: "head: cannot open 'file' for reading: No such file or directory", even though I give the whole path: bash test.sh /path/myfile.txt – Lucas Aug 10 '20 at 20:02
  • Thanks @thanasisp. When I run you bash script with my file I do not get any output file – Lucas Aug 10 '20 at 21:36
  • @Lucas I tested it with tour sample input (without the dots "...") and I see 4 files with the headers created as expected. – thanasisp Aug 10 '20 at 21:45