I am using the following source code How to split a huge csv file based on content of first column. In order to split a file based on the value of the first column into multiple binary files.
file
1 v1 v2 v3
1 v1 v2 v3
1 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3
output
1.bin
1 v1 v2 v3
1 v1 v2 v3
1 v1 v2 v3
2.bin
2 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3
I added a condition where if the group is less than 2 then I don't write.
#! /usr/bin/python
# -*- coding: utf-8 -*-
import re, sys
import xml.etree.ElementTree as ET
import os
import csv
from itertools import groupby
def split_file(file, path):
for key, rows in groupby(csv.reader(open(file), delimiter=' '), lambda row: row[0]):
length = len(list(rows))
if(length > 2):
with open(path + "%s.bin" % key, "wb+") as output:
for row in rows:
l = len(row) - 1
print str(l)+" " + " ".join(row[1:]) + "\n"
output.write(str(l)+" " + " ".join(row[1:]) + "\n")
if __name__ == "__main__":
#tf-idf file to split_file
#path of binary files
split_file(sys.argv[1], sys.argv[2])
The problem is that when I add length = len(list(rows))
and the condition if(length > 10):
it doesn't write anymore? I really don't get it!!!