Parsing a CSV file for a unique row using the new Java 8 Streams API

Question

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.

name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70

The parsing code is as follows:

@Override
public void init() throws Exception {
    Map<String, String> params = getParameters().getNamed();
    if (params.containsKey("csvfile")) {
        Path path = Paths.get(params.get("csvfile"));
        if (Files.exists(path)){
            // use the new java 8 streams api to read the CSV column headings
            Stream<String> lines = Files.lines(path);
            List<String> columns = lines
                .findFirst()
                .map((line) -> Arrays.asList(line.split(",")))
                .get();
            columns.forEach((l)->System.out.println(l));
            // find the relevant sections from the CSV file
            // we are only interested in the row with Neda's name
            int nameIndex = columns.indexOf("name");
            int ageIndex columns.indexOf("age");
            int heightIndex = columns.indexOf("height");
            // we need to know the index positions of the 
            // have to re-read the csv file to extract the values
            lines = Files.lines(path);
            List<List<String>> values = lines
                .skip(1)
                .map((line) -> Arrays.asList(line.split(",")))
                .collect(Collectors.toList());
            values.forEach((l)->System.out.println(l));
        }
    }        
}

Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.

Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?

How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.

Why not save the lines on the first read and stream from that? — samczsun, Jan 06 '16 at 18:18
Note that this won't work for some otherwise perfectly valid CSV files, such as one containing the line `"Neda",14,66` — David Conrad, Jan 06 '16 at 19:10

Holger · Answer 1 · 2016-01-06T18:48:37.683

Use a BufferedReader explicitly:

List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
    String firstLine=br.readLine();
    if(firstLine==null) throw new IOException("empty file");
    columns=Arrays.asList(firstLine.split(","));
    values = br.lines()
        .map(line -> Arrays.asList(line.split(",")))
        .collect(Collectors.toList());
}

Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.

Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.

thanks the trick to avoid the double read worked really well, thanks for that — johnco3, Jan 09 '16 at 03:43

Tunaki · Answer 2 · 2016-01-06T18:48:36.507

6

First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.

In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").

So in the second part of your code, you could have:

lines = Files.lines(path);
List<List<String>> values = lines
            .skip(1)
            .map(line -> Arrays.asList(line.split(",")))
            .filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
            .collect(Collectors.toList());

Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.

Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in @Holger's answer.

edited Jan 06 '16 at 18:48

answered Jan 06 '16 at 18:45

Tunaki

132,869
46
340
423

@Tunaki I found the filter very useful - thanks, do you know how I could return just a List instead of a List> given that I am explicitly filtering a single row - better still a mapping between the col names and the values of this filtered row – johnco3 Jan 06 '16 at 18:52
1

@johnco3 This depends on how many rows will have the name "Neda". After `.filter(` you can call `findFirst()` to return the first item for example. You will have a `List` then – Tunaki Jan 06 '16 at 19:00
@Tunaki How does it read the first line twice? skip(1) should skip the first line and move on with rest of the lines. I am not sure if I understand your comment about this. Thanks! – TriCore Jan 06 '16 at 19:10
1

@TriCore: `skip` ensures that items aren’t processed by the subsequent stream operations, but can’t avoid that the source has to read/generate them first, before they can get skipped. A `BufferedReader` has to read the first line to know, where the second line starts; there is no way around this. – Holger Jan 06 '16 at 19:12
1

@TriCore You might also want to read [that answer](http://stackoverflow.com/a/32414480/1743880) (and [that one](http://stackoverflow.com/a/32414407/1743880)) as it explains well what happens with limit and skip. – Tunaki Jan 06 '16 at 19:17
@Tunaki findFirst().get() returns the List thanks! Do you perchance know how to make a mapping between the headers and the Values - as part of this functional stream mapping - I already have a list of headers so presumably there should be some lambda magic I I could apply like Collectors.toMap - I cannot figure out the syntax – johnco3 Jan 06 '16 at 19:57
@Tunaki - I was trying something like Map map = lines.skip(1).map((line) -> Arrays.asList(line.split(","))).filter(list -> list.get(0).equals("Neda")).collect(Collectors.toMap(???,????)); I cannot figure out the 'toMap()' parameters that would make a JDBC resultset like object that I could query, or perhaps as an alternative I should be using the Collectors.mapMerger(...) function to merge in the columns values. – johnco3 Jan 06 '16 at 20:08
1

@johnco3: maybe you want to open a new question? – Holger Jan 06 '16 at 20:25
@Holger Probably a good idea, however this is actually part of the original question - last paragraph - the details are starting to evolve in the comments section here though – johnco3 Jan 06 '16 at 20:27
1

@johnco3: it’s exactly the evolution of questions in comment that should be avoided. The preferred way on SO is to ask multiple questions, perhaps provide links to each other, rather than putting too much into one question. So you had two (or even three) questions in one here. Don’t hesitate to split them. – Holger Jan 06 '16 at 20:30
2

@johnco3 I agree with Holger in that you should open a new question. To give you a hint though, it is from the list you get that you should create the map from (to create the map, you need to iterate over the elements of the list, not iterate over lines of the files ;) ). – Tunaki Jan 06 '16 at 20:37

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

Using a CSV-processing library

Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.

The Java ecosystem offers several such libraries. I use Apache Commons CSV.

The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.

create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?

Apache Commons CSV does this automatically when you call withHeader.

return just one row in the form of List

Yes, easy to do.

As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.

List < String > tuple = List.of();  // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.

We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.

CSVFormat format =  CSVFormat.RFC4180.withHeader() ;

We specify the file path where to find our input file.

Path path = Path.of("/Users/basilbourque/people.csv");

We use try-with-resources syntax (see Tutorial) to automatically close our parser.

As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.

try (
        CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
    for ( CSVRecord record : parser )
    {
        if ( record.get( "name" ).equals( "Neda" ) )
        {
            tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
            break ;
        }
    }
}
catch ( FileNotFoundException e )
{
    e.printStackTrace();
}
catch ( IOException e )
{
    e.printStackTrace();
}

If we found success, we should see some items in our List.

if ( tuple.isEmpty() )
{
    System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
    System.out.println( "Success. Found this row for name of `Neda`:" );
    System.out.println( tuple.toString() );
}

When run.

Success. Found this row for name of Neda:

[Neda, 14, 66]

Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.

score 0 · Answer 4 · answered Apr 04 '18 at 13:28

I know I'm responding so late, but maybe it will help someone in the future

I've made a csv parser/writer , easy to use thanks to its builder pattern

For your case: you can filter the lines you want to parse using

csvLineFilter(Predicate<String>)

Hope you find it handy, here is the source code https://github.com/i7paradise/CsvUtils-Java8/

I've joined a main class Demo.java to display how it works

Parsing a CSV file for a unique row using the new Java 8 Streams API

4 Answers4

Using a CSV-processing library

Linked

Related