I seek an example of applying a regular expression to a Java I/O stream that doesn't simply convert the stream to a string as I would like to preserve binary data. Most of the examples on the Internet focus on text data...
Asked
Active
Viewed 6,764 times
8

Alan Moore
- 73,866
- 12
- 100
- 156

McGovernTheory
- 6,556
- 4
- 41
- 75
-
1What are you looking to do ? Reject data that doesn't match the regexp ? And what do you want to match on if you're not interested in strings ? Some clarification would be good – Brian Agnew Apr 04 '09 at 11:26
-
Just for clarification: A conversion to characters and back to binary data may have a performance impact but not a single byte will be lost due to the conversion. – rwitzel Sep 11 '13 at 20:25
-
1possible duplicate of [Performing regex on a stream](http://stackoverflow.com/questions/3013669/performing-regex-on-a-stream) – Kalle Richter Oct 29 '14 at 01:50
4 Answers
9
The needed functionality is not present on Java Standard. You will have to use jakarta regexp, and specifically, the StreamCharacterIterator class. This class encapsulates a InputStream for use in regexp operations.
If you want to use the standard regular expression package, I would suggest take a the source from the previous class here and change the contract by implementing CharSequence instead of CharacterIterator.

HMM
- 2,987
- 1
- 20
- 30
-
1One issue with implementing CharSequence is that the interface requires the class to implement 'public int length()'. If you're reading from a stream, then you won't know the length and won't be able to return an answer to the regex engine. – monkeysplayingpingpong Jan 10 '13 at 14:50
0
Try to use Ragel - regular expression tool with transitions callbacks.
Can applied to streams and chunks.

DenisKolodin
- 13,501
- 3
- 62
- 65
0
Convert the stream to a byte array.

tpdi
- 34,554
- 11
- 80
- 120
-
It should be mentioned that this makes only sense if the input can be loaded into memory in both terms of its size and the time necessary for the loading! That means you need to know the lengh of the data provided by the stream in order to write a reliable program. Knowing the input length of a stream contradicts its purpose to provide potentially endless data! – Kalle Richter Oct 29 '14 at 01:34
-2
Regex operations must be performed on strings, which are encoded bytes of binary data. You can't perform regex operations on bytes of data you have no idea what they represent.

Yuval Adam
- 161,610
- 92
- 305
- 395
-
8-1 I disagree. There is no reason why you cannot apply regular expressions to binary data. Binary data does not mean you don't have idea what they represent. – HMM Apr 04 '09 at 11:47
-
Supposedly, you could take a stream of 0's and 1's and perform regex on it. However none of the existing Java APIs give you access to that raw stream without converting it to something more meaningful. – Yuval Adam Apr 04 '09 at 12:00
-
+1 agree, Applying a regexp on binary data does not make sense. Regexps are fundamentally geared towards Strings, they're defined using Strings, so you'll always be using a string encoding, either explicitly or implicitly. – Michael Borgwardt Apr 04 '09 at 12:30
-
I'm not voting up or down, but suppose you had a "binary" protocol like ASN.1 or Java serialization. It would make sense to look for regular expressions in such a string of bytes. – erickson Apr 04 '09 at 19:16
-
1There is a danger that some portion of the binary data might match your regexp by coincidence. In which case you may end up making a bogus match, or corrupting the binary data. Depending on your subject data and regexp, you may be able to discard such concerns. But in the general case, binary data can contain strings which do not actually represent strings, implying a risk of false matches. That is why it would be better practice to separate the data first, and why a truly general solution does not exist. Having said that, I upvoted the other answer, because it helps the OP more. ;) – joeytwiddle Jul 19 '09 at 14:12
-
Applying a regex on binary data makes sense, for example, in network protocols. Binary data just means that you have "no encoding" or "no character set". This requies, however, that both your input-data and your regex is binary data. – kristianlm Sep 09 '13 at 18:49