Fastest way to extract uppercase characters in Java

Question

I am currently working with DNA sequences in the form of strings where introns are in lowercase and exons in uppercase characters. The aim of the method is to retrieve the exons in the form of a String as fast as possible.

Example of a sequence :

ATGGATGACAGgtgagaggacactcgggtcccagccccaggctctgccctcaggaagggggtcagctctcaggggcatctccctctcacagcccagccctggggatgatgtgggagccccatttatacacggtgcctccttctctcctagAGCCTACATAG

My first version was using the String replaceAll() method but it was particularly slow:

public String getExons(String sequence) {
    return sequence.replaceAll("[atgc]", "");
}

So I tried a new version which improved performance but remains fairly slow:

public String getExons(String sequence) {
    StringBuilder exonBuilder = new StringBuilder();

    for (int i = 0; i < sequence.length(); i++) {
        char c = sequence.charAt(i);
        if (c == 'A' || c == 'T' || c == 'G' || c == 'C') exonBuilder.append(c);
    }
    return exonBuilder.toString();

Is there another way to do it which would improve performance ?

how should that be "slow" with that less input and that less work to do? — SomeJavaGuy, Sep 02 '16 at 11:29
You can use `new StringBuilder(sequence.length())` to avoid having to resize the internal `char[]`. I can't imagine that'd make too much difference, though, with an input that small. — Andy Turner, Sep 02 '16 at 11:29
You may try using a char array instead of the StringBuilder. Initialize it to the size of the sequence and then manipulate the index to get the position where each found letter will be stored. — Slimu, Sep 02 '16 at 11:30
And you can always split the input string in multiple parts and use parallelization. — B.Gen.Jack.O.Neill, Sep 02 '16 at 11:31
@AndyTurner, Specifying the capacity is a good idea, but it would be a waste of space if you just use sequence.length(). Thibault Robin, if you have an estimate of the number of uppercase in the sequence - use this value as initial capacity. — Egor, Sep 02 '16 at 11:31
@KevinEsche I chose a small sequence as example for readability, but gene sequences can go up to ~5000-10000 characters and above. Moreover this method has to be called at least once per genre, and in some cases I have more than 100000 genes to deal with. — Thibault Robin, Sep 02 '16 at 11:34
If you really want to go all out on speed, don't use strings at all. One nucleotide pair can be encoded in four bits: two bits for a/g/t/c, one for intron/exon status, and the last one just for padding, so that a byte contains exactly two nucleotide pairs. Carrying all the baggage of processing Unicode characters for something that has so few values is not worth it. — biziclop, Sep 02 '16 at 11:45
Little bit off-topic, but if you really need the speed and you work with very, very large strings, you can always take a look at using GPU. There are some librarries to use with Java as jocl for OpenCL. You could even use some OpenGL wrapper and use shaders directly. — B.Gen.Jack.O.Neill, Sep 02 '16 at 11:45
I suppose that solution from myselfmayur below might be fastest (you can also presize StringBuilder for small saving) for problem as given. But I wonder if real savings can be done elsewhere - do you really need to pass around these sequences as Strings? You might want to invest into specialized data structure (as it is just few bits of information per pair, rather than 16-bit char). It might turn out to be beneficial later, when you start to do less trivial operations on sequences, as you could use some bitwise manipulation instead of doing char-by-char iterations. — Artur Biesiadowski, Sep 02 '16 at 11:47
The other question is whether you really need to create a copy of the sub-sequences. If the sub-sequences you need are fairly long and there isn't many of them per gene, you can just record where each sub-sequence starts/ends. — biziclop, Sep 02 '16 at 11:49
@biziclop - you have beaten me with that suggestion by few seconds;) Anyway, as far as exact representation is concerned, I would probably go for 2-bits per pair (for 4 per byte) and have separate bit-by-bit array for intron/exon status. Many operations will work only on one part of this data (and it avoids 1 bit of padding per pair) — Artur Biesiadowski, Sep 02 '16 at 11:51
@ArturBiesiadowski Yes, there are many options for a compact data structure, and which one is best will largely depend on the kind of operations you want to perform. I just gave a simple example of one such possible structure. As you say, if most of your operations are on just the letters or just the status, storing them separately is better. But if you always process them together, you'll also want to keep the information close together in memory. — biziclop, Sep 02 '16 at 11:56
When the question is about *speed*, why do some people suggest to use *space* efficient data structures, like bit encoding in a byte? You do notice that it's not really cheap to extract bits from one byte and insert them into another? Also, no sense in assuming that byte-wide operations are *any* faster than 16 or 32 bit wide ones on your average 32 or 64 bit CPU... — JimmyB, Sep 02 '16 at 12:13
@JimmyB Because when you've shrunk your data to half the size (for example), you can fit twice as much in the super fast L1/L2/L3 processor caches, making processing them blazing fast. This is the part where it's okay start micro-optimizing, if you know what you're doing. — Kayaman, Sep 02 '16 at 12:40
@JimmyB You don't have to do it byte-by-byte, the idea is exactly to do it on several bytes in one go. — biziclop, Sep 02 '16 at 13:33
As above, to be really fast don't use String at all. Another tip: Instead of reading the data into a String with a Reader, read it into a byte[] with an InputStream. The difference? char is 16 bits, and a byte is 8 bits, you don't need the extra 8 bits for unicode, asuming your input data is ASCII. Consider also a 'direct' ByteBuffer - which is fast in Java. — user2800708, Sep 02 '16 at 13:43
@biziclop Doesn't make a difference. Unless you can use some smart SIMD instructions, there's not much you can do to simplify transforming an input of, say, 8*4=32 bits into an output of 0...8*4 bits where the length of the output depends on the input data. — JimmyB, Sep 02 '16 at 21:09
@kayaman Valid point. I don't believe it would have a significant effect in the OP's context (5-10k chars per object), but maybe I'm giving it a try when I have time. — JimmyB, Sep 02 '16 at 21:16
If you have thousands of different strings to filter, you really should consider parallizing the task. But not too much. Creating 100k individual tasks to submit to an executor would be a huge overhead. I'd probably create one thread per CPU core and have it iterate over the correspondig fraction of the total number of input strings. That's basically how you'd do it on a GPU too. — JimmyB, Sep 02 '16 at 21:23
Maybe worth exploring, (instead of looking for uppercase lowercase letters) if we know the positions for start end of sequences then get exon-intron positions and substr. Or if positions are unknown, align it to genome, then get exon-intron positions. I can see this can be done using fasta as input using R packages, not sure if there are a packages for Java. — zx8754, Sep 04 '16 at 09:16

vz0 · Accepted Answer · 2016-09-05T06:38:06.190

~~You need to use a char array with double pointer trick.~~ I get this results in my machine:

Edit: updated with warmup phase. Java is OpenJDK 8 from Ubuntu 14 LTS

Edit2: ~~hash table is the fastest by far margin.~~

Edit3: I had a bug in my code. The double pointer trick is the fastest.

GTCtgACgGT
getExons1: 1068
getExons2: 377
getExons3: 313
getExons3b: 251
getExons4: 586
getExons5: 189
getExons6: 671

Edit4: Running the benchmarks with JMH with a 1M DNA String. The result is consistent with my previous benchmark regarding "x better than y" and the worst being the regexp, the best being the double pointer, and the second best being the naive 3B:

Benchmark                  Mode  Cnt    Score   Error  Units
MyBenchmark.benchExons1   thrpt  200   33.659 ± 1.036  ops/s
MyBenchmark.benchExons2   thrpt  200  107.095 ± 4.074  ops/s
MyBenchmark.benchExons3a  thrpt  200  118.543 ± 3.779  ops/s
MyBenchmark.benchExons3b  thrpt  200  163.717 ± 4.602  ops/s
MyBenchmark.benchExons4   thrpt  200   69.942 ± 2.019  ops/s
MyBenchmark.benchExons5   thrpt  200  191.142 ± 5.307  ops/s
MyBenchmark.benchExons6   thrpt  200   57.654 ± 1.963  ops/s

Edit5: With 10 MB Strings:

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.benchExons1   thrpt  200   4.640 ± 0.068  ops/s
MyBenchmark.benchExons2   thrpt  200  13.451 ± 0.161  ops/s
MyBenchmark.benchExons3a  thrpt  200  15.379 ± 0.232  ops/s
MyBenchmark.benchExons3b  thrpt  200  19.550 ± 0.181  ops/s
MyBenchmark.benchExons4   thrpt  200   8.510 ± 0.147  ops/s
MyBenchmark.benchExons5   thrpt  200  24.343 ± 0.331  ops/s
MyBenchmark.benchExons6   thrpt  200   7.339 ± 0.074  ops/s

The code:

package org.sample;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;

import java.util.HashMap;
import java.util.Random;

@State(Scope.Thread)
public class MyBenchmark {

    String DNA;

    public MyBenchmark() {
        DNA = buildRandomDNA(1000000);
    }

    static String letters = "ATGCatgc";

    public static String buildRandomDNA(int size) {
        StringBuilder builder = new StringBuilder(size);
        Random r = new Random();

        for (int i = 0; i < size; ++i) {
            builder.append(letters.charAt(r.nextInt(letters.length())));
        }

        return builder.toString();
    }

    @Benchmark
    public void benchExons1() {
        getExons1(DNA);
    }

    @Benchmark
    public void benchExons2() {
        getExons2(DNA);
    }

    @Benchmark
    public void benchExons3a() {
        getExons3a(DNA);
    }

    @Benchmark
    public void benchExons3b() {
        getExons3b(DNA);
    }

    @Benchmark
    public void benchExons4() {
        getExons4(DNA);
    }

    @Benchmark
    public void benchExons5() {
        getExons5(DNA);
    }

    @Benchmark
    public void benchExons6() {
        getExons6(DNA);
    }

    public static String getExons1(String sequence) {
        return sequence.replaceAll("[atgc]", "");
    }

    public static String getExons2(String sequence) {
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length(); i++) {
            char c = sequence.charAt(i);
            if (c == 'A' || c == 'T' || c == 'G' || c == 'C')
                exonBuilder.append(c);
        }
        return exonBuilder.toString();
    }

    public static String getExons3a(String sequence) {
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length(); i++) {
            char c = sequence.charAt(i);
            if (c <= 'Z') {
                exonBuilder.append((char) c);
            }
        }

        return exonBuilder.toString();
    }

    public static String getExons3b(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length; i++) {
            if (sequence[i] <= 'Z') {
                exonBuilder.append(sequence[i]);
            }
        }

        return exonBuilder.toString();
    }

    public static HashMap<String, String> M = new HashMap<String, String>();

    public static void buildTable() {
        for (int a = 0; a < letters.length(); ++a) {
            for (int b = 0; b < letters.length(); ++b) {
                for (int c = 0; c < letters.length(); ++c) {
                    for (int d = 0; d < letters.length(); ++d) {
                        String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
                        M.put(key, getExons1(key));
                    }
                }
            }
        }
    }

    public static String getExons4(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        for (int i = 0; i < sequence.length; i += 4) {
            exonBuilder.append(M.get(new String(sequence, i, 4)));
        }

        return exonBuilder.toString();
    }

    public static String getExons5(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        int p = 0;

        for (int i = 0; i < sequence.length; i++) {
            if (sequence[i] <= 'Z') {
                sequence[p] = sequence[i];
                ++p;
            }
        }

        return new String(sequence, 0, p);
    }

    public static int dnatoint(char[] s, int start, int len) {
        int key = 0;
        for (; len > 0; len--, start++) {
            switch (s[start]) {
            case 'A': key = (key << 3) | 0; break;
            case 'C': key = (key << 3) | 1; break;
            case 'G': key = (key << 3) | 2; break;
            case 'T': key = (key << 3) | 3; break;
            case 'a': key = (key << 3) | 4; break;
            case 'c': key = (key << 3) | 5; break;
            case 'g': key = (key << 3) | 6; break;
            case 't': key = (key << 3) | 7; break;
            }
        }
        return key;
    }

    public static String[] M2 = new String[8*8*8*8];

    public static void buildTable2() {
        for (int a = 0; a < letters.length(); ++a) {
            for (int b = 0; b < letters.length(); ++b) {
                for (int c = 0; c < letters.length(); ++c) {
                    for (int d = 0; d < letters.length(); ++d) {
                        String key = "" + letters.charAt(a) + letters.charAt(b) + letters.charAt(c) + letters.charAt(d);
                        M2[dnatoint(key.toCharArray(), 0, 4)] = getExons1(key);
                    }
                }
            }
        }
    }

    public static String getExons6(String sequence1) {
        char[] sequence = sequence1.toCharArray();
        StringBuilder exonBuilder = new StringBuilder();

        assert (sequence.length % 4) == 0;

        for (int i = 0; i < sequence.length; i += 4) {
            exonBuilder.append(M2[dnatoint(sequence, i, 4)]);
        }

        return exonBuilder.toString();
    }

    static {
        buildTable();
        buildTable2();
    }

    //@Benchmark
    public void testMethod() {
        // This is a demo/sample template for building your JMH benchmarks. Edit as needed.
        // Put your benchmark code here.
    }

}

Nice, although the [micro-benchmark isn't quite correct](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). — Kayaman, Sep 02 '16 at 13:14
Your benchmark shows misleading results because it collects many [common benchmarking pitfalls](http://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html): 1) it incorporates several benchmarks in a single method; 2) it measures an [OSR stub](http://stackoverflow.com/questions/9105505/differences-between-just-in-time-compilation-and-on-stack-replacement) instead of the final compiled version; 3) it does not consume results, etc. — apangin, Sep 02 '16 at 13:36
Thank you very much for your answer! I tried the getExons5 method and indeed it improves quite a lot the program execution speed. I don't think it is possible to make it quicker without changing data structure as some people suggested. — Thibault Robin, Sep 02 '16 at 13:59
@ThibaultRobin Yes, you should pre-process the data and save it as a new file, changing the data structure. 10k x 100k is 1GB of DNA. That's cheap storage nowdays. — vz0, Sep 02 '16 at 14:02
I've re-run the benchmarks with JMH, with a 1M string instead of the 10M original. I'll update once the 10M is finished. With 1M the whole string fits in cache. — vz0, Sep 03 '16 at 05:48

score 5 · Answer 2 · answered Sep 02 '16 at 11:36

5

Use below:

 public String getExons(String sequence) {
    StringBuilder exonBuilder = new StringBuilder();

    for (int i = 0; i < sequence.length(); i++) {
        char c = sequence.charAt(i);
        if((int)c>=65 && (int)c <=90){
           exonBuilder.append((char)c);
         }
     }

    return exonBuilder.toString();

answered Sep 02 '16 at 11:36

Mayur Maheshwari

160
1
9

7

There's no actual need to cast to `int`: `c >= 65` is equivalent. Also, you can simply use `c >= 'A'`, to remove your magic numbers. – Andy Turner Sep 02 '16 at 11:44
3

One comparision will be enough. In `(char)c` cast is not necessary. – talex Sep 02 '16 at 11:45
1

It's probably a good idea to pass expected size to `StringBuilder` constructor. E.g. `sequence.length ()` or something. – Sep 02 '16 at 11:50
1). You need only one check: `c<=90`. 2). Iterating over char array created from `String` (via `toCharArray`) will outperform current approach. 3). Writing to a huge `char[]` and then creating a `String` from it will outperform current approach. If you apply the changes I've mentioned you will receive ~x2 performance from the current one. – user3707125 Sep 02 '16 at 12:07
1

I would be very careful considering optimalization speed gain estimates. Java and even CPU have many internal mechanisms that reduce complexity and could easily neglect some tweaks by using precompiled code, branch prediction etc. – B.Gen.Jack.O.Neill Sep 02 '16 at 12:37

Jean-François Savard · Answer 3 · 2016-09-05T11:56:51.490

5

Here an alternative which is, in my opinion more concise, and as fast as others suggested here.

sequence.chars()
    .filter(c -> c >= 65 && c <= 90)
    .collect(StringBuilder::new, 
         (StringBuilder sb, int c) -> sb.append((char) c), 
         StringBuilder::append);

edited Sep 05 '16 at 11:56

answered Sep 02 '16 at 15:40

Jean-François Savard

20,626
7
49
76

I don't think we're looking for opinions here. We're looking for measurable facts. – Kayaman Sep 05 '16 at 08:44
1

@Kayaman Obviously as this would be against SO's rules. In fact, this code is as fast as other loops used here. I stated that the code is more concise in my opinion specially so that users don't take this for cash and use what they do find more readable. – Jean-François Savard Sep 05 '16 at 11:56

score 1 · Answer 4 · answered Sep 02 '16 at 13:21

1

public static void main(String[] args) {
    String str = "ATGGATGACAGgtgagaggacactcgggtcccagccccaggctctgccctcaggaagggggtcagctctcaggggcatctccctctcacagcccagccctggggatgatgtgggagccccatttatacacggtgcctccttctctcctagAGCCTACATAG";
    byte[] bytes = str.getBytes();

    for(int i=0; i<str.length(); i++) {
        if (str.charAt(i) >= 90) {
            bytes[i] = (byte) (str.charAt(i) - 32);
        }
    }

    System.out.println(new String(bytes));
}

answered Sep 02 '16 at 13:21

Liping Huang

4,378
4
29
46

This doesn't seem to fulfil the requirements of the asker. – Kayaman Sep 02 '16 at 13:29
@Kayaman your means only replace the 'a', 't', 'g', 'c'? so just compare the int valule, like `str.charAt(i) == 97 || str.charAt(i) == 116 || str.charAt(i) == 103 || str.charAt(i) == 99` – Liping Huang Sep 02 '16 at 13:33
No. Your solution doesn't: extract uppercase characters as a `String`. – Kayaman Sep 02 '16 at 13:36

Fastest way to extract uppercase characters in Java

4 Answers4