0

I am using the following source code How to split a huge csv file based on content of first column. In order to split a file based on the value of the first column into multiple binary files.


file

1 v1 v2 v3
1 v1 v2 v3
1 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3

output

1.bin
1 v1 v2 v3
1 v1 v2 v3
1 v1 v2 v3
2.bin
2 v1 v2 v3
2 v1 v2 v3
2 v1 v2 v3

I added a condition where if the group is less than 2 then I don't write.

#! /usr/bin/python
# -*- coding: utf-8 -*-
import re, sys
import xml.etree.ElementTree as ET
import os
import csv
from itertools import groupby

        
def split_file(file, path):
 for key, rows in groupby(csv.reader(open(file), delimiter=' '), lambda row: row[0]):
     length = len(list(rows))
     if(length > 2):
      with open(path + "%s.bin" % key, "wb+") as output:
        for row in rows:
          l = len(row) - 1
          print str(l)+" " + " ".join(row[1:]) + "\n"
          output.write(str(l)+" " + " ".join(row[1:]) + "\n")
            
if __name__ == "__main__": 
  #tf-idf file to split_file
  #path of binary files
  split_file(sys.argv[1], sys.argv[2])

The problem is that when I add length = len(list(rows))and the condition if(length > 10): it doesn't write anymore? I really don't get it!!!

Community
  • 1
  • 1
Hani Goc
  • 2,371
  • 5
  • 45
  • 89

1 Answers1

2

rows is a generator. By passing it to list() you've exhausted the generator and cannot loop over it again.

Convert rows to a list first, then take the length of that separately:

rows = list(rows)
length = len(rows)
if length > 2:

or simply test the length:

rows = list(rows)
if len(rows) > 2:
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I thought wrong that it wont change. thank you. Martin am I writing correctly to a binary file too? – Hani Goc Oct 01 '15 at 11:00
  • 1
    @HaniGoc: you don't need the `+` in the mode unless you are also planning to read from the file again at the same time. Personally, I'd use a `csv.writer()` object to produce the output instead. – Martijn Pieters Oct 01 '15 at 11:26