3

What is the fastest way to check if a file contains a certain string or number?

jzd
  • 23,473
  • 9
  • 54
  • 76
Fseee
  • 2,476
  • 9
  • 40
  • 63

4 Answers4

5

Have a look at the Scanner class, that ships with JDK (See official documentation). You will be able to skip certain parts of input (in this case - text file) and match against regular expression of your desire. I'm not sure if this is the most efficient way, but sure enough - it's pretty simple. You might also take a look at this example, which will help you get started.

reesjones
  • 704
  • 3
  • 9
  • 28
ŁukaszBachman
  • 33,595
  • 11
  • 64
  • 74
2

Untried, but probably the fastest mechanism is to first, take your search key and encode it like the file.

For example, if you know the file is UTF-8, take your key and encode it from a String (which it UTF-16) in to a byte array that is UTF-8. This is important because by encoding down to the file representation, you're only encoding the key. Using standard Java Readers goes the other way -- converts the file to UTF-16.

Now that you have a proper key, in bytes, use NIO to create a MappedByteBuffer for the file. This maps the file in to the virtual memory space.

Finally, implement a Boyer-Moore algorithm for string search, using the bytes of the key against the bytes of the file via the mapped region,

There may well be a faster way, but this solves a bulk of the problems with searching a text file in Java. It leverages the VM to avoid copying large chunks of the file, and it skips the conversion step of whatever encoding the file is in to UTF-16, which Java uses internally.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
0

The best realization I've found in MIMEParser: https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java

/**
 * Finds the boundary in the given buffer using Boyer-Moore algo.
 * Copied from java.util.regex.Pattern.java
 *
 * @param mybuf boundary to be searched in this mybuf
 * @param off start index in mybuf
 * @param len number of bytes in mybuf
 *
 * @return -1 if there is no match or index where the match starts
 */

private int match(byte[] mybuf, int off, int len) {

Needed also:

private void compileBoundaryPattern();
Grigory Kislin
  • 16,647
  • 10
  • 125
  • 197
0

Check out the following algorithms:

or if you want to find one of a set of strings:

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216