Reason for the Python script getting really slow if rewritten in Ruby?

Question

I've been practicing machine learning on the task of restoring spaces in a joined text. Since I decided to use the dictionary feature, I searched the web for some ideas of splitting the text based on the dictionary, and I stumbled upon this idea. Based on it, I've written a script that converts the text without spaces to a vertical form needed by the ML tool:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
from math import log
import string
import fileinput

words = open("dictionary.txt", encoding="utf8").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        #original script was pretty basic, so symbols\words not in dictionary
        #broke the processing completely. This fixed the problem.
        if s[i-1] not in wordcost:
            wordcost[s[i-1]] = log((len(words) + 1)*log(len(words)))
        c,k = best_match(i)
        cost.append(c)
        print(cost)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

def char_type(s):
    """ Character type function """
    if s in string.punctuation:
        return "P"
    elif s in string.digits:
        return "D"
    elif s in string.ascii_letters:
        return "F"
    elif s.isupper():
        return "U"
    else:
        return "R"


def test_to_vert(s):
    """
   Transforms regular text into a vertical form.
   """
    s = s.rstrip('\n')
    orig_sent = s
    a = s.lower().replace("ё", "е")
    a = infer_spaces(a)
    space_indices = []
    a = list(a)
    for i,k in enumerate(a):
        if k == " ":
            space_indices.append(i)

    orig_sent = list(orig_sent)
    for i in space_indices:
        orig_sent.insert(i, " ")

    orig_sent = "".join(orig_sent)
    orig_sent = orig_sent.split(" ")

    answer = []

    for word in orig_sent:
        i = 0
        for letter in word:
            answer.append(letter + "\t" + letter.lower() + "\t" + \
                  char_type(letter) + "\t" + str(i) + "|" + str(len(word)))                                                    
            i += 1
    return '\n'.join(answer)

testfile = open("head.txt", encoding="utf8")
output = open("test_python.txt", 'w', newline="\n", encoding="utf8")



for line in testfile:
    if line in ['\n', '\r\n']:
        output.write('\n')
    else:
        output.write(test_to_vert(line))
        output.write('\n\n')

output.write('\n\n\n')
testfile.close()
output.close()

So far so good, it works. After that I decided to practice my Ruby (I'm relatively new to coding), so I tried to re-write the script (Ruby version):

#!/usr/bin/ruby
#encoding: UTF-8
Encoding::default_internal = "UTF-8"
Encoding::default_external = "UTF-8"

require 'active_support/core_ext'

@wordcost = Hash.new
@count = %x{wc -l dictionary.txt}.split.first.to_i

i = 0

File.readlines("dictionary.txt").each do |line|
  line.chomp!

  @wordcost[line.mb_chars.downcase.to_s] ||= Math.log((i+1) * Math.log(@count))
  i += 1
end

def infer_spaces(s)

  @sent = s.chomp

  def best_match(i)
    result = []
    candidates = @cost[0, i].reverse
    candidates.each_index do |index|
      if @wordcost.has_key?(@sent[i-index-1...i].mb_chars.downcase.to_s)
        result << [(candidates[index] + @wordcost[@sent[i-index-1...i].mb_chars.downcase.to_s]), (index + 1)]
      else
        result << [(candidates[index] + Float::INFINITY), (index + 1)]
      end
    end
    result.sort!
    return result[0][0], result[0][1]
  end

  @cost = [0]
  for i in (1..@sent.length)
    @wordcost[@sent[i-1].mb_chars.downcase.to_s] ||= Math.log(@count * Math.log(@count))
    c, k = best_match(i)
    @cost << c
  end

  out = []
  i = @sent.length
  while i>0
    c, k = best_match(i)
    if c != @cost[i]
      raise "Something went wrong"
    end
    out << @sent[i-k...i]
    i -= k
  end

  return out.reverse.join(" ")

end

def char_type(string)
  case string
  when /[[:punct:]]/
    return "P"
  when /[[:digit:]]/
    return "D"
  when /[A-z]/
    return "F"
  when /[[:upper:]]/
    return "U"
  else
    return "R"
  end
end

def test_to_vert(s)
  s.chomp!
  orig_sent = s
  a = s.mb_chars.downcase.to_s
  a = infer_spaces(a)
  space_indices = []
  a = a.split("")
  a.each_index do |i|
    if a[i] == " "
      space_indices << i
    end
  end
  orig_sent = orig_sent.split("")
  space_indices.each do |x|
    orig_sent.insert(x, " ")
  end
  orig_sent = orig_sent.join
  orig_sent = orig_sent.split

  answer = []

  orig_sent.each do |word|
    letters = word.split("")
    letters.each_index do |i|
      answer << letters[i] + "\t" + letters[i].mb_chars.downcase.to_s + \
      "\t" + char_type(letters[i]) + "\t" + i.to_s + "|" + word.length.to_s
    end
  end

  return answer.join("\n")
end

file = File.open('test_ruby_vert.txt', 'w')

File.readlines("test.txt").each do |line|
  if line.chomp.empty?
    file.write("\n")
  else
    file.write(test_to_vert(line))
    file.write("\n\n")
  end
end

file.close

The rewritten script works, however, it is really slow compared to the Python version (a ~40000-line text is processed in like not more than an hour, a Ruby script worked for hours for now, and it only processed like 15% of the text).

I wonder what could slow it down so much? Could it be that is because of the fact that i need to use "active_support/core_ext" to downcase Cyrillic text in Ruby? Could it be because I don't limit the processing in best_match using maxword? Maybe some other rewrite really messed the script up? Any insight will be really helpful for me.

It would be very difficult to say for sure, without profiling the Ruby, which is a lot of work. Even just looking through 100 lines of linked code is quite an ask. In general though Ruby code has some efficient String and Array processing, but has a high overhead cost for manipulating individual objects, and it is easy to miss places where it duplicates strings, or generates millions of non-necessary objects. I'd recommend, if you have time, using Ruby's profiler to gain an insight yourself. — Neil Slater, Mar 20 '14 at 08:28
@NeilSlater Thanks. I never tried to use it before, but I guess this is the time. I'm really interested in what did I do so wrong that the script is so slow compared to the Python version. — Vilmar, Mar 20 '14 at 08:32
A simple mechanical conversion of the code would not be "wrong", but may expose language differences where Ruby does something much slower than Python. Ruby is not known for being inherently fast. Converting C to Ruby mechanically can result in code that is slower by a factor of 100. Sometimes this can be resolved by restructuring to use Ruby core methods that are more efficient (but that don't exist in the language converting from) — Neil Slater, Mar 20 '14 at 08:36
Another option to improving the performance of a Ruby program is to use a faster Ruby implementation, such as [Rubinius](http://rubini.us/). — , Mar 20 '14 at 08:54
Also, since this is a question about performance, maybe it's a better fit for the [Code Review Stack Exchange](http://codereview.stackexchange.com/help/on-topic)? — , Mar 20 '14 at 08:58
@Cupcake True, my question may be more appropriate on Code Review. If you can migrate the question here, I would be grateful. — Vilmar, Mar 20 '14 at 09:04
can't run this, what does the "wc -l dictionary.txt" do ? try to add a counter to methods you run many times, take the method with the largest count and benchmark it, trying to make it faster, also try to limit the number of time this method is run — peter, Mar 20 '14 at 12:05

score 3 · Accepted Answer · answered Mar 20 '14 at 12:56

I didn't take a close look (there's just way too much code in your question to do a detailed examination, you really need to wittle it down to an SSCCE), but a few things jumped out at me.

The most important one is that Language Implementations are designed to make idiomatic, well-factored, well-designed code run fast. Your code, however, looks more like Fortran than Ruby, it is definitely neither idiomatic Ruby nor well-factored.

Some smaller observations:

Here you are needlessly creating lots of string objects:

answer << letters[i] + "\t" + letters[i].mb_chars.downcase.to_s + \
  "\t" + char_type(letters[i]) + "\t" + i.to_s + "|" + word.length.to_s

You should prefer mutating a single string using << over creating many temporary strings using +:

answer << ('' << letters[i] << "\t" << letters[i].mb_chars.downcase.to_s <<
  "\t" << char_type(letters[i]) << "\t" << i.to_s << "|" << word.length.to_s)

But really, string interpolation is much more idiomatic (and incidentally much faster):

answer << "#{letters[i]}\t#{letters[i].mb_chars.downcase}\t#{char_type(letters[i])}\t#{i}|#{word.length}"

You have a lot of unnecessary returns in your code. Again, that is non-idiomatic, and also slower. For example here:

def char_type(string)
  case string
  when /[[:punct:]]/
    return "P"
  when /[[:digit:]]/
    return "D"
  when /[A-z]/
    return "F"
  when /[[:upper:]]/
    return "U"
  else
    return "R"
  end
end

This should be written just

def char_type(string)
  case string
  when /[[:punct:]]/
    "P"
  when /[[:digit:]]/
    "D"
  when /[A-z]/
    "F"
  when /[[:upper:]]/
    "U"
  else
    "R"
  end
end

There are other place with unnecessary returns as well.

Within your infer_spaces method you define another global method named best_match. Since infer_spaces is called by test_to_vert, which is called inside your readlines loop, the method will be defined over and over and over again for every line in the file, which means that (since most Ruby implementations nowadays are compiled), it will have to be compiled over and over and over and over again. Each redefinition will also invalidate all previous optimizations such as speculative inlining. Just move the method definition outside of the loop.

IO::readlines reads the entire file into memory as an array. Then you iterate over the array. You might just as well iterate over the lines of the file directly, using IO::foreach instead:

File.foreach("test.txt") do |line|

This will avoid loading the entire file into memory at once.

You didn't say which Ruby Implementation you are using. Since you have a fairly hot and tight loop, using an implementation with some sort of hotspot optimizations, polymorphic inline caching, speculative inlining, adaptive optimizations and so on, might make a big difference, especially if you fix the recompilation problem for best_match. Rubinius and JRuby are good candidates here. Rubinius, for example, has been demonstrated to be faster than hand-optimized C in certain cases!

Note: these are all just micro-optimizations. I didn't actually take a look at your algorithm. You can probably get much more performance by tweaking the algorithm rather than micro-optimize the implementation.

For example: in the Python implementation of best_match, you use min to find the minimum element, which is O(n), whereas in Ruby, you sort and then return the first element, which is O(n * log n).

This is a lot of useful information for me here, Jörg. Thank you for your time! — Vilmar, Mar 20 '14 at 13:18

score 0 · Answer 2 · answered Mar 20 '14 at 08:59

0

I don't know if this will help you, but a lot of the math packages in Python is implemented in C, and is therefore quite fast.

From http://docs.python.org/2/library/math.html : "The math module consists mostly of thin wrappers around the platform C math library functions"

Maybe the use of the logarithmic function from math is the reason the Python script is so much faster?

answered Mar 20 '14 at 08:59

heidivikki

22
4

The only math used in my script is "log". I don't suppose it leads to such a huge increase in processing? – Vilmar Mar 20 '14 at 09:02
1

The math library in Ruby is also written in C: http://ruby-doc.org/core-2.1.1/Math.html#method-c-log – Patrick Oscity Mar 20 '14 at 09:14
Ah, sorry, I didn' know that. Should I remove the answer? I'm new here. – heidivikki Mar 20 '14 at 09:44
@heidivikki: That is up to you. Deleting an answer means you do not get credit (or demerit) for upvotes and downvotes on it. If you don't wish to stand by the answer or deal with comments on it then deleting it may be best. However, a "wrong" answer does provide some good on Stack Overflow, because someone who initially agrees with your answer will read the comments and may understand, like you, why it is not correct in this case. – Neil Slater Mar 20 '14 at 11:14

Reason for the Python script getting really slow if rewritten in Ruby?

2 Answers2