0

I have a file of a few hundred megabytes containing strings:

str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2

where x1 and x2 are some numbers. How big the numbers x(...x)1 and x(...x)2 are is unknown.

Each line has in "\n" in it. I have a list of strings str2 and str4.

I want to find the corresponding numbers for those strings.

What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):

source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
  index = source_str.index(x)
  if index
    a = source_str[index .. index + x.length] # a contains "str2"

    #?? how do I "select" xx1 and xx2 ??


    # and finally...
    # res << num1
    # res << num2
  end
end

Note that I can't apply source_str.split("\n") due to the error ArgumentError: invalid byte sequence in UTF-8 and I can't fix it by changing a file in any way. The file can't be changed.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Mario Honse
  • 289
  • 1
  • 3
  • 10
  • 1
    What is `read_from_file()`? You are slurping the entire file into memory at once? That is hardly scalable. Instead, consider using `foreach` and iterating over the file line-by-line. It's just as fast, and a lot more scalable. We need to have better input samples. Give us reasonable examples for `str2` and `str4`. What OS are you on? – the Tin Man Nov 11 '14 at 03:42
  • read_from_file() - a method that returns a whole content of a file, it's been said. – Mario Honse Nov 11 '14 at 04:25
  • "[Why is slurping a file bad?](http://stackoverflow.com/q/25189262/128421) explains why you don't want to read the entire file into memory. – the Tin Man Nov 11 '14 at 16:13
  • If you found at least one of the answers helpful, don't forget to select one. – Cary Swoveland Nov 14 '14 at 17:24

2 Answers2

3

You want to avoid reading a hundred of megabytes into memory, as well as scanning them repeatedly. This has the potential of taking forever, while clogging the machine's available memory.

Try to re-frame the problem, so you can treat the large input file as a stream, so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".

str_to_find = [str2, str4]
numbers = []
File.foreach('foo.txt') do |li|
  columns = li.split
  numbers += columns[2] if str_to_find.include?(columns.shift)
end

Also, read again @theTinMan's answer regarding the file encoding - what he is suggesting is that you may be able fine-tune the reading of the file to avoid the error, without changing the file itself.

If you have a very large number of items in str_to_find, I'd suggest that you use a Set instead of an Array for better performance:

str_to_find = [str1, str2, ... str5000].to_set
Uri Agassi
  • 36,848
  • 14
  • 76
  • 93
  • `so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".` -- does that make any difference? a times b is the same b times a, isn't it? – Mario Honse Nov 11 '14 at 07:03
  • the size of str_to_find is quite big as well. – Mario Honse Nov 11 '14 at 07:04
  • 1
    @MarioHonse - sure it makes a difference - you don't need to have the whole 100MB of text in memory... Also, finding a substring inside a large string is much harder than matching two strings to see if they are the same (not to mention less buggy - `str1` exists in `str12`, although it might not be what you are looking for). How large is `str_to_find`? does it have thousands of entries? – Uri Agassi Nov 11 '14 at 07:22
  • `str_to_find` can contain approximately 50 thousand entries. – Mario Honse Nov 11 '14 at 09:11
  • in that case, you might want to use a `Set` instead of an array - arriving to the answer of whether a line is relevant or not in `O(1)` instead of `O(n)`. I'll update the answer – Uri Agassi Nov 11 '14 at 09:19
  • If you are trying to match 50K entries, then you should consider different using a database, building a schema that lets you split your columns into fields, and let the database engine do the heavy lifting. That's what they're made for. – the Tin Man Nov 11 '14 at 16:10
  • Why Set, not a hash table? I need to do a search in it, so using a HashTable is better, isn't it? – Mario Honse Nov 12 '14 at 04:01
  • The difference between a Set and a Hash, is that the Set has just the keys without the values. In this use case, there is no use for values, so Set is more appropriate than Hash – Uri Agassi Nov 12 '14 at 04:54
2

If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.

The IO class has the foreach method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.

If you had your source input file saved as "foo.txt", you could read it using something like:

str2 = 'some value'
str4 = 'some other value'
numbers = []
File.foreach('foo.txt') do |li|
  numbers << li.split[2] if li[str2] || li[str2]
end

At the end of the loop numbers should contain the numbers you want.

You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the open_args to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.


I have to find a list of values, iterating through each line doesn't seem sensible because I'd have to iterate for each value over and over again.

We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.

Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.

Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • `If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.` - why is it better? I have to find a **list** of values, iterating through each line doesn't seem sensible because I'd have to iterate for **each** value over and over again. – Mario Honse Nov 11 '14 at 04:27
  • `You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it.` - you should read more carefully what I've written. You don't need the clue because having the clue meaning you're going to change a source file somehow which is in my case impossible, it can't be touched because it may change some text so that I won't be able to find it. – Mario Honse Nov 11 '14 at 04:31
  • Thanks for the method "split", I didn't know it may not take any arguments. – Mario Honse Nov 11 '14 at 04:33
  • And the last thing `numbers << li.split[2] if li[str2] || li[str2]` but I have A LOT of strings like str2, NOT only 2. In my example there're only 2 strings, but really I have a lot of them so I can't use `||`. – Mario Honse Nov 11 '14 at 04:34
  • 1
    **you should read more carefully what I've written. You don't need the clue because having the clue meaning you're going to change a source file somehow which is in my case impossible,** I guess your file must be a quantum file. How did you make it? – 7stud Nov 11 '14 at 07:23
  • @MarioHonse, we can only work with the information you give us. You said "I have a list of strings str2 and str4." That is two. If you give us the right parameters, we can give you better answers, but it is always a case of GIGO because we can not read your mind. It really seems like you have decided your way will be the best because you're arguing with us. It could be that we have tested, run benchmarks, etc., and advice given is based on those. – the Tin Man Nov 11 '14 at 16:07
  • @MarioHonse, "you're going to change a source file somehow which is in my case impossible, it can't be touched because it may change some text so that I won't be able to find it." Really? Nothing suggested would possibly change the file since it would be opened in read-only mode. Knowing the encoding allows Ruby to read the file and convert the incoming binary bytes into the expected UTF-8, which then will avoid the error you are seeing. If you are concerned about Ruby internally using UTF-8, then you need to spend some time learning about how Ruby manages character encodings. – the Tin Man Nov 11 '14 at 17:00