1

Hi I am new to spring batch, I want to create multiple files(csv) per chunk processed. FileName will be something like timestamp.csv. Any idea how can I do that? Basically it is splitting one big file to smaller files.

Thank you!

PAA
  • 1
  • 46
  • 174
  • 282
Lp. Don
  • 105
  • 1
  • 7
  • So you wish to split big files into smaller one before kicking in chunk ( read -> process -> write ) logic & one small file be input to step's chunk processor ? – Sabir Khan Nov 26 '19 at 14:09
  • Hi @SabirKhan, 1 file (xlsx actually) containing roughly 600k-800k records. Since it is too big of a file to process, i have to split it first into csv file containing 100k records (chunk). – Lp. Don Nov 28 '19 at 01:59
  • So this part for which question was asked ( file splitting ) looks more like a **job preprocessing** to me than actual job logic so you can very well write a custom splitter logic in [JobExecutionListenerSupport.beforeJob](https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/listener/JobExecutionListenerSupport.html#beforeJob-org.springframework.batch.core.JobExecution-) & then set up actual job on all files for directory of splitted files. For efficient execution - partitioning seems a use case here. – Sabir Khan Nov 28 '19 at 04:06

3 Answers3

1

CSV files are basically text files with a new line character in the end.

So as far as splitting a big CSV file into smaller files is concerned, you simply need to read big file line by line in Java & when your read line count reaches threshold count / max count per small file ( 10, 100 , 1000 etc ) , you create a new file with naming convention as per your need & dump data there.

How to read a large text file line by line using Java?

BufferedReader is the main class to read a text file line by line.

And implementing this logic has nothing to do with Spring Batch but can be in Java or using OS level commands.

So you have two distinct logical pieces , reading the the big file line by line & creating csv ...you can develop these two pieces as separate components & plug it into Spring Batch Framework at appropriate place as per your business requirement.

There is a java library to deal with CSV files easily & you might like to use it - depending on complexity involved .

<dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>4.6</version>
</dependency>
Sabir Khan
  • 9,826
  • 7
  • 45
  • 98
1

I would use a command line utility like the split command (or equivalent) or try to do it with plain Java (See Java - Read file and split into multiple files).

But if you really want to do it with Spring Batch, then you can use something like:

import java.time.LocalDateTime;

import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.batch.item.ExecutionContext;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.file.FlatFileItemReader;
import org.springframework.batch.item.file.FlatFileItemWriter;
import org.springframework.batch.item.file.builder.FlatFileItemReaderBuilder;
import org.springframework.batch.item.file.mapping.PassThroughLineMapper;
import org.springframework.batch.item.file.transform.PassThroughLineAggregator;
import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.AnnotationConfigApplicationContext;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.FileSystemResource;

@Configuration
@EnableBatchProcessing
public class MyJob {

    private final JobBuilderFactory jobBuilderFactory;

    private final StepBuilderFactory stepBuilderFactory;

    public MyJob(JobBuilderFactory jobBuilderFactory, StepBuilderFactory stepBuilderFactory) {
        this.jobBuilderFactory = jobBuilderFactory;
        this.stepBuilderFactory = stepBuilderFactory;
    }

    @Bean
    public FlatFileItemReader<String> itemReader() {
        return new FlatFileItemReaderBuilder<String>()
                .name("flatFileReader")
                .resource(new FileSystemResource("foos.txt"))
                .lineMapper(new PassThroughLineMapper())
                .build();
    }

    @Bean
    public ItemWriter<String> itemWriter() {
        final FlatFileItemWriter<String> writer = new FlatFileItemWriter<>();
        writer.setLineAggregator(new PassThroughLineAggregator<>());
        writer.setName("chunkFileItemWriter");
        return items -> {
            writer.setResource(new FileSystemResource("foos" + getTimestamp() + ".txt"));
            writer.open(new ExecutionContext());
            writer.write(items);
            writer.close();
        };
    }

    private String getTimestamp() {
        // TODO tested on unix/linux systems, update as needed to not contain illegal characters for a file name on MS windows
        return LocalDateTime.now().toString();
    }

    @Bean
    public Step step() {
        return stepBuilderFactory.get("step")
                .<String, String>chunk(3)
                .reader(itemReader())
                .writer(itemWriter())
                .build();
    }

    @Bean
    public Job job() {
        return jobBuilderFactory.get("job")
                .start(step())
                .build();
    }

    public static void main(String[] args) throws Exception {
        ApplicationContext context = new AnnotationConfigApplicationContext(MyJob.class);
        JobLauncher jobLauncher = context.getBean(JobLauncher.class);
        Job job = context.getBean(Job.class);
        jobLauncher.run(job, new JobParameters());
    }

}

The file foos.txt is the following:

foo1
foo2
foo3
foo4
foo5
foo6

The example will write each chunk in a separate file with a timestamp:

File1 foos2019-11-28T09:23:47.769.txt:

foo1
foo2
foo3

File2 foos2019-11-28T09:23:47.779.txt:

foo4
foo5
foo6

I think it's better to use a number instead of a timestamp BTW.

NB: I would not care much about restartability for such a use case.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • Hi, just a quick question - what is the purpose of `ExecutionContext` passed to call - `writer.open(new ExecutionContext())` ? can this be reused or need to be created new every time? Also, there is problem on Windows for file name created by call - `LocalDateTime.now()` as it contains not allowed character colon , `java.nio.file.InvalidPathException: Illegal char <:> at index` – Sabir Khan Nov 27 '19 at 13:55
  • the execution context can be reused. And for the timestamp, yes it might be an issue if it results in a value containing an illegal character for a file name on windows (I tested it on mac os before posting the answer). I updated the answer accordingly and left this detail to the user. – Mahmoud Ben Hassine Nov 27 '19 at 14:11
  • Hi @MahmoudBenHassine i am getting error of Caused by: java.lang.IllegalArgumentException: The resource must be set. And actually for now I just did some solution using ClassifierCompositeItemWriter from here https://stackoverflow.com/questions/15974458/spring-batch-writing-data-to-multiple-files-with-dynamic-file-name, – Lp. Don Nov 28 '19 at 02:01
  • @Lp.Don No you shouldn't, make sure your resource is not `null`. I updated the answer with a complete example. This kind of step can be used as a preparatory task for a partitioned step (This is actually a portable way of splitting a file compared to using a `SystemCommandTasklet` with a non portable OS specific command). @Sabir Khan I should mention that if you want to re-use the execution context, you need to clear it after each iteration, but that's not a big deal, it will be GCed anyway. – Mahmoud Ben Hassine Nov 28 '19 at 08:42
0

Use Partitioner in spring batch for implementation details please check

  1. this article
  2. or this

and check the API documentation here

Sandeep Kumar
  • 2,397
  • 5
  • 30
  • 37