17

I have a file which I would like to read in Java and split this file into n (user input) output files. Here is how I read the file:

int n = 4;
BufferedReader br = new BufferedReader(new FileReader("file.csv"));
try {
    String line = br.readLine();

    while (line != null) {
        line = br.readLine();
    }
} finally {
    br.close();
}

How do I split the file - file.csv into n files?

Note - Since the number of entries in the file are of the order of 100k, I can't store the file content into an array and then split it and save into multiple files.

Mifeet
  • 12,949
  • 5
  • 60
  • 108
Ankit Rustagi
  • 5,539
  • 12
  • 39
  • 70
  • in the while loop, just collect as many lines , as you want, into a String or StringBuilder and write them to separate files. You cannot know the number of files beforehand, it might be better, to define a maximum number of lines in a file. – John Smith Oct 04 '13 at 09:41
  • You either need to loop twice, once to get the number of lines and once to split. Or you could guess at the number of lines and split that way. – Boris the Spider Oct 04 '13 at 09:41
  • @kw4nta why on earth would you want to _store_ the lines. 1) the OP says that storing all the lines isn't an option, 2) given that you can write the lines straight to another file... – Boris the Spider Oct 04 '13 at 09:43
  • I suggest you to do a first pass where you count the number of lines. On a second pass, you divide it by `n` and create `n` files containing `total/n` lines. Use `BufferedReader.readLine()` for that purpose. – Arnaud Denoyelle Oct 04 '13 at 09:43
  • 6
    Another solution if it makes sense in this use case : use a round robbin algorithm (first line to first file, second line to second file etc) – Arnaud Denoyelle Oct 04 '13 at 09:44

11 Answers11

25

Since one file can be very large, each split file could be large as well.

Example:

Source File Size: 5GB

Num Splits: 5: Destination

File Size: 1GB each (5 files)

There is no way to read this large split chunk in one go, even if we have such a memory. Basically for each split we can read a fix size byte-array which we know should be feasible in terms of performance as well memory.

NumSplits: 10 MaxReadBytes: 8KB

public static void main(String[] args) throws Exception
    {
        RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
        long numSplits = 10; //from user input, extract it from args
        long sourceSize = raf.length();
        long bytesPerSplit = sourceSize/numSplits ;
        long remainingBytes = sourceSize % numSplits;

        int maxReadBufferSize = 8 * 1024; //8KB
        for(int destIx=1; destIx <= numSplits; destIx++) {
            BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
            if(bytesPerSplit > maxReadBufferSize) {
                long numReads = bytesPerSplit/maxReadBufferSize;
                long numRemainingRead = bytesPerSplit % maxReadBufferSize;
                for(int i=0; i<numReads; i++) {
                    readWrite(raf, bw, maxReadBufferSize);
                }
                if(numRemainingRead > 0) {
                    readWrite(raf, bw, numRemainingRead);
                }
            }else {
                readWrite(raf, bw, bytesPerSplit);
            }
            bw.close();
        }
        if(remainingBytes > 0) {
            BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
            readWrite(raf, bw, remainingBytes);
            bw.close();
        }
            raf.close();
    }

    static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
        byte[] buf = new byte[(int) numBytes];
        int val = raf.read(buf);
        if(val != -1) {
            bw.write(buf);
        }
    }
Community
  • 1
  • 1
harsh
  • 7,502
  • 3
  • 31
  • 32
  • 7
    Well it may split a line midway and it matters for csv file – Pujan Dec 29 '15 at 04:12
  • Is there a way to overcome this? so that it doesn't split midline? – Julian Oct 11 '17 at 10:01
  • In my Company We have Fixed Record Size for each Column and we do padding into CSV so we divide the file size with with one record size and then we split. also While reading each line is sent on MQ to be inserted so that it is async. Anyways your soultion is good. – Kumar Abhishek Oct 17 '17 at 17:35
  • You can scan the end of the buffer to find the portion of the last line and add it to the next file. – Peter Lawrey Jul 04 '18 at 13:32
  • 1
    According to the `RandomAccessFile` documentation, `.read(` is not obligated to fill the entire destination buffer. Maybe this code should use `.readFully(` instead? – Jesbus Feb 07 '20 at 17:13
  • @harsh this is amazing! was able to split 2.2gb file into 10 different files! – Gaurav Jul 14 '20 at 13:30
  • Good one! But when splitting the file, it would be good to maintain the column/header as well. Any suggestions? – Ashok kumar Ganesan Jul 12 '22 at 21:21
9
import java.io.*;  
import java.util.Scanner;  
public class split {  
public static void main(String args[])  
{  
 try{  
  // Reading file and getting no. of files to be generated  
  String inputfile = "C:/test.txt"; //  Source File Name.  
  double nol = 2000.0; //  No. of lines to be split and saved in each output file.  
  File file = new File(inputfile);  
  Scanner scanner = new Scanner(file);  
  int count = 0;  
  while (scanner.hasNextLine())   
  {  
   scanner.nextLine();  
   count++;  
  }  
  System.out.println("Lines in the file: " + count);     // Displays no. of lines in the input file.  

  double temp = (count/nol);  
  int temp1=(int)temp;  
  int nof=0;  
  if(temp1==temp)  
  {  
   nof=temp1;  
  }  
  else  
  {  
   nof=temp1+1;  
  }  
  System.out.println("No. of files to be generated :"+nof); // Displays no. of files to be generated.  

  //---------------------------------------------------------------------------------------------------------  

  // Actual splitting of file into smaller files  

  FileInputStream fstream = new FileInputStream(inputfile); DataInputStream in = new DataInputStream(fstream);  

  BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine;  

  for (int j=1;j<=nof;j++)  
  {  
   FileWriter fstream1 = new FileWriter("C:/New Folder/File"+j+".txt");     // Destination File Location  
   BufferedWriter out = new BufferedWriter(fstream1);   
   for (int i=1;i<=nol;i++)  
   {  
    strLine = br.readLine();   
    if (strLine!= null)  
    {  
     out.write(strLine);   
     if(i!=nol)  
     {  
      out.newLine();  
     }  
    }  
   }  
   out.close();  
  }  

  in.close();  
 }catch (Exception e)  
 {  
  System.err.println("Error: " + e.getMessage());  
 }  

}  

}   
user3556411
  • 105
  • 2
  • 6
  • 1
    This doesn't do what the OP wanted (set number of files), but it does what I want (set number of lines). Good code! Modified it to be a function taking in a file name and dynamically naming created files. – Autumn Leonard Dec 04 '15 at 21:15
  • C&P from http://javaprogramming.language-tutorial.com/2012/10/split-huge-files-into-small-text-files.html ? (The blog entry is from 2012) – bish Sep 08 '16 at 09:53
  • Why should the number of lines be double? – Omid.N May 20 '20 at 22:02
  • one thing is very important that is scanner isn't thread safe – Muhammad_08 Nov 02 '22 at 11:51
2

Though its a old question but for reference I am listing out the code which I used to split large files to any sizes and it works with any Java versions above 1.4 .

Sample Split and Join blocks were like below:

public void join(String FilePath) {
    long leninfile = 0, leng = 0;
    int count = 1, data = 0;
    try {
        File filename = new File(FilePath);
        //RandomAccessFile outfile = new RandomAccessFile(filename,"rw");

        OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
        while (true) {
            filename = new File(FilePath + count + ".sp");
            if (filename.exists()) {
                //RandomAccessFile infile = new RandomAccessFile(filename,"r");
                InputStream infile = new BufferedInputStream(new FileInputStream(filename));
                data = infile.read();
                while (data != -1) {
                    outfile.write(data);
                    data = infile.read();
                }
                leng++;
                infile.close();
                count++;
            } else {
                break;
            }
        }
        outfile.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public void split(String FilePath, long splitlen) {
    long leninfile = 0, leng = 0;
    int count = 1, data;
    try {
        File filename = new File(FilePath);
        //RandomAccessFile infile = new RandomAccessFile(filename, "r");
        InputStream infile = new BufferedInputStream(new FileInputStream(filename));
        data = infile.read();
        while (data != -1) {
            filename = new File(FilePath + count + ".sp");
            //RandomAccessFile outfile = new RandomAccessFile(filename, "rw");
            OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
            while (data != -1 && leng < splitlen) {
                outfile.write(data);
                leng++;
                data = infile.read();
            }
            leninfile += leng;
            leng = 0;
            outfile.close();
            count++;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

Complete java code available here in File Split in Java Program link.

a113nw
  • 1,312
  • 1
  • 16
  • 26
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/12423371) – CubeJockey May 20 '16 at 19:41
  • 1
    Thanks, Updated the comment. – user1472187 May 20 '16 at 22:59
1

a clean solution to edit.

this solution involves loading the entire file into memory.

set all line of a file in List<String> rowsOfFile;

edit maxSizeFile to choice max size of a single file splitted

public void splitFile(File fileToSplit) throws IOException {
  long maxSizeFile = 10000000 // 10mb
  StringBuilder buffer = new StringBuilder((int) maxSizeFile);
  int sizeOfRows = 0;
  int recurrence = 0;
  String fileName;
  List<String> rowsOfFile;

  rowsOfFile = Files.readAllLines(fileToSplit.toPath(), Charset.defaultCharset());

  for (String row : rowsOfFile) {
      buffer.append(row);
      numOfRow++;
      sizeOfRows += row.getBytes(StandardCharsets.UTF_8).length;
      if (sizeOfRows >= maxSizeFile) {
          fileName = generateFileName(recurrence);
          File newFile = new File(fileName);

          try (PrintWriter writer = new PrintWriter(newFile)) {
              writer.println(buffer.toString());
          }

          recurrence++;
          sizeOfRows = 0;
          buffer = new StringBuilder();
      }
  }
  // last rows
  if (sizeOfRows > 0) {
      fileName = generateFileName(recurrence);
      File newFile = createFile(fileName);

      try (PrintWriter writer = new PrintWriter(newFile)) {
          writer.println(buffer.toString());
      }
  }
  Files.delete(fileToSplit.toPath());
}

method to generate Name of file:

    public String generateFileName(int numFile) {
      String extension = ".txt";
      return "myFile" + numFile + extension;
    }
Aymen
  • 1,476
  • 18
  • 28
0

Have a counter to count no of entries. Let's say one entry per line.

step1: Initially create new subfile, set counter=0;

step2: increment counter as you read each entry from source file to buffer

step3: when counter reaches limit to number of entries that you want to write in each sub file, flush contents of buffer to subfile. close the subfile

step4 : jump to step1 till you have data in source file to read from

Pranalee
  • 3,389
  • 3
  • 22
  • 36
0

There's no need to loop twice through the file. You could estimate the size of each chunk as the source file size divided by number of chunks needed. Then you just stop filling each cunk with data as it's size exceeds estimated.

Leff
  • 582
  • 3
  • 12
0

Here is one that worked for me and I used it to split 10GB file. it also enables you to add a header and a footer. very useful when splitting document based format such as XML and JSON because you need to add document wrapper in the new split files.

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;

public class FileSpliter
{
    public static void main(String[] args) throws IOException
    {
        splitTextFiles("D:\\xref.csx", 750000, "", "", null);
    }

    public static void splitTextFiles(String fileName, int maxRows, String header, String footer, String targetDir) throws IOException
    {
        File bigFile = new File(fileName);
        int i = 1;
        String ext = fileName.substring(fileName.lastIndexOf("."));

        String fileNoExt = bigFile.getName().replace(ext, "");
        File newDir = null;
        if(targetDir != null)
        {
            newDir = new File(targetDir);           
        }
        else
        {
            newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
        }
        newDir.mkdirs();
        try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
        {
            String line = null;
            int lineNum = 1;
            Path splitFile = Paths.get(newDir.getPath() + "\\" +  fileNoExt + "_" + String.format("%02d", i) + ext);
            BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
            while ((line = reader.readLine()) != null)
            {
                if(lineNum == 1)
                {
                    System.out.print("new file created '" + splitFile.toString());
                    if(header != null && header.length() > 0)
                    {
                        writer.append(header);
                        writer.newLine();
                    }
                }
                writer.append(line);

                if (lineNum >= maxRows)
                {
                    if(footer != null && footer.length() > 0)
                    {
                        writer.newLine();
                        writer.append(footer);
                    }
                    writer.close();
                    System.out.println(", " + lineNum + " lines written to file");
                    lineNum = 1;
                    i++;
                    splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%02d", i) + ext);
                    writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
                }
                else
                {
                    writer.newLine();
                    lineNum++;
                }
            }
            if(lineNum <= maxRows) // early exit
            {
                if(footer != null && footer.length() > 0)
                {
                    writer.newLine();
                    lineNum++;
                    writer.append(footer);
                }
            }
            writer.close();
            System.out.println(", " + lineNum + " lines written to file");
        }

        System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
    }
}
amralieg
  • 113
  • 8
0

Below code used to split a big file into small files with lesser lines.

    long linesWritten = 0;
    int count = 1;

    try {
        File inputFile = new File(inputFilePath);
        InputStream inputFileStream = new BufferedInputStream(new FileInputStream(inputFile));
        BufferedReader reader = new BufferedReader(new InputStreamReader(inputFileStream));

        String line = reader.readLine();

        String fileName = inputFile.getName();
        String outfileName = outputFolderPath + "\\" + fileName;

        while (line != null) {
            File outFile = new File(outfileName + "_" + count + ".split");
            Writer writer = new OutputStreamWriter(new FileOutputStream(outFile));

            while (line != null && linesWritten < linesPerSplit) {
                writer.write(line);
                line = reader.readLine();
                linesWritten++;
            }

            writer.close();
            linesWritten = 0;//next file
            count++;//nect file count
        }

        reader.close();

    } catch (Exception e) {
        e.printStackTrace();
    }
  • The code I have written above is working and I have tested for a file with 40L records/line. It takes around 10 secs to split the file into chinks of 1L line per file. – Narendra Kumar Samal Nov 21 '17 at 06:22
  • The code above is missing re-adding the line separator. `writer.write(System.lineSeparator());` is needed, otherwise it is 1 huge line. – rveach Aug 02 '18 at 18:44
0

Split a file to multiple chunks (in memory operation), here I'm splitting any file to a size of 500kb(500000 bytes) :

public static List<ByteArrayOutputStream> splitFile(File f) {
List<ByteArrayOutputStream> datalist = new ArrayList<>();
try {

    int sizeOfFiles = 500000;
    byte[] buffer = new byte[sizeOfFiles];

    try (FileInputStream fis = new FileInputStream(f); BufferedInputStream bis = new BufferedInputStream(fis)) {

        int bytesAmount = 0;
        while ((bytesAmount = bis.read(buffer)) > 0) {
            try (OutputStream out = new ByteArrayOutputStream()) {
                out.write(buffer, 0, bytesAmount);
                out.flush();
                datalist.add((ByteArrayOutputStream) out);
            }
        }
    }
} catch (Exception e) {
    //get the error
}

return datalist; }
Ajith
  • 87
  • 3
  • 12
0

I am a bit late to answer, But here's how I did it:

Approach:

First I determine how many bytes each of the individual files should contain then I split the large file by bytes. Only one file chunk worth of data is loaded into memory at a time.

Example:- if a 5 GB file is split into 10 files then only 500MB worth of bytes are loaded into memory at a time which are held in the buffer variable in the splitBySize method below.

Code Explaination:

The method splitFile first gets the number of bytes each of the individual file chunks should contain by calling the getSizeInBytes method, then it calls the splitBySize method which splits the large file by size (i..e maxChunkSize represents the number of bytes each of file chunks will contain).

public static List<File> splitFile(File largeFile, int noOfFiles) throws IOException {
    return splitBySize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles));
}

public static List<File> splitBySize(File largeFile, int maxChunkSize) throws IOException {
    List<File> list = new ArrayList<>();
    int numberOfFiles = 0;
    try (InputStream in = Files.newInputStream(largeFile.toPath())) {
        final byte[] buffer = new byte[maxChunkSize];
        int dataRead = in.read(buffer);
        while (dataRead > -1) {
            list.add(stageLocally(buffer, dataRead));
            numberOfFiles++;
            dataRead = in.read(buffer);
        }
    }
    System.out.println("Number of files generated: " + numberOfFiles);
    return list;
}

private static int getSizeInBytes(long totalBytes, int numberOfFiles) {
    if (totalBytes % numberOfFiles != 0) {
        totalBytes = ((totalBytes / numberOfFiles) + 1)*numberOfFiles;
    }
    long x = totalBytes / numberOfFiles;
    if (x > Integer.MAX_VALUE){
        throw new NumberFormatException("Byte chunk too large");

    }
    return (int) x;
}

Full Code:

public class StackOverflow {

private static final String INPUT_FILE_PATH = "/Users/malkesingh/Downloads/5MB.zip";
private static final String TEMP_DIRECTORY = "/Users/malkesingh/temp";

public static void main(String[] args) throws IOException {

    File input = new File(INPUT_FILE_PATH);
    File outPut = fileJoin2(splitFile(input, 5));

    try (InputStream in = Files.newInputStream(input.toPath()); InputStream out = Files.newInputStream(outPut.toPath())) {
        System.out.println(IOUtils.contentEquals(in, out));
    }

}

public static List<File> splitFile(File largeFile, int noOfFiles) throws IOException {
    return splitBySize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles));
}

public static List<File> splitBySize(File largeFile, int maxChunkSize) throws IOException {
    List<File> list = new ArrayList<>();
    int numberOfFiles = 0;
    try (InputStream in = Files.newInputStream(largeFile.toPath())) {
        final byte[] buffer = new byte[maxChunkSize];
        int dataRead = in.read(buffer);
        while (dataRead > -1) {
            list.add(stageLocally(buffer, dataRead));
            numberOfFiles++;
            dataRead = in.read(buffer);
        }
    }
    System.out.println("Number of files generated: " + numberOfFiles);
    return list;
}

private static int getSizeInBytes(long totalBytes, int numberOfFiles) {
    if (totalBytes % numberOfFiles != 0) {
        totalBytes = ((totalBytes / numberOfFiles) + 1)*numberOfFiles;
    }
    long x = totalBytes / numberOfFiles;
    if (x > Integer.MAX_VALUE){
        throw new NumberFormatException("Byte chunk too large");

    }
    return (int) x;
}




private static File stageLocally(byte[] buffer, int length) throws IOException {
    File outPutFile = File.createTempFile("temp-", "split", new File(TEMP_DIRECTORY));
    try(FileOutputStream fos = new FileOutputStream(outPutFile)) {
        fos.write(buffer, 0, length);
    }
    return outPutFile;
}


public static File fileJoin2(List<File> list) throws IOException {
    File outPutFile = File.createTempFile("temp-", "unsplit", new File(TEMP_DIRECTORY));
    FileOutputStream fos = new FileOutputStream(outPutFile);
    for (File file : list) {
        Files.copy(file.toPath(), fos);
    }
    fos.close();
    return outPutFile;
}}
Malkeith Singh
  • 235
  • 1
  • 8
0
import java.util.*;
import java.io.*;
public class task13 {
    public static void main(String[] args)throws IOException{
        Scanner s =new Scanner(System.in);
        System.out.print("Enter path:");
        String a=s.next();
        File f=new File(a+".txt");
        Scanner st=new Scanner(f);
        System.out.println(f.canRead()+"\n"+f.canWrite());
        long l=f.length();
        System.out.println("Length is:"+l);
        System.out.print("Enter no.of partitions:");
        int p=s.nextInt();
        long x=l/p;
        st.useDelimiter("\\Z");
        String t=st.next();
        int j=0;
        System.out.println("Each File Length is:"+x);
        for(int i=1;i<=p;i++){
           File ft=new File(a+"-"+i+".txt");
           ft.createNewFile();
           int g=(j*(int)x);
           int h=(j+1)*(int)x;
           if(g<=l&&h<=l){
           FileWriter fw=new FileWriter(a+"-"+i+".txt");
               String v=t.substring(g,h);            
               fw.write(v);
               j++;
               fw.close();
           }}
        }}
K.TEJ
  • 1
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 06 '23 at 06:51