3

I have a CSV-type Unicode text file that actually consists of two float values (x and y), an int (i), and two strings (s and t) per line. The lines are sorted by the first float value but I do not make use of that yet.

The task is to extract all lines that have the float values in a specified range (xmin < x < xmax and ymin < y < ymax). Think of a geometrical search for surrounding items.

Since the file is rather large (1.3 million lines, 67 MBytes), I do not want to read it into memory all at once. Instead I used the StreamReader class provided here.

For performance testing, I implemented several different functions:

  1. read the file line by line, ignoring the contents completely (just counting lines)
  2. same, but actually filter the matching lines (A)
  3. same, but filter (B)
  4. read the file line by line and store each line as an array element (Swift struct / Python tuple)
  5. using the array, filter by float comparison

Filtering (A) is done by string comparison instead of converting the string values to float numbers first. Filtering (B) does float conversion and float comparison. Conversion/comparison is done as necessary, i.e. first only x is processed; y is only processed if x already matches.

When testing my functions on an iMac, I get the following results (approx.):

Default debug configuration (no optimization):

  1. 450k lines/s
  2. 120k lines/s
  3. 45k lines/s
  4. 18k lines/s; the array built up needs 300 MBytes
  5. 3480k entries/s

Default release configuration (optimized -Os):

  1. 780k lines/s (factor: 1.7 to Swift_Debug)
  2. 130k lines/s (factor: 1.1)
  3. 68k lines/s (factor: 1.5)
  4. 20k lines/s (factor: 1.1)
  5. 208000k entries/s (!) (factor: 59.8)

For comparison, doing the same in Python yields:

  1. 8140k lines/s (factor: 10 to Swift_Release)
  2. 4430k lines/s (factor: 34)
  3. 2010k lines/s (factor: 30)
  4. 710k lines/s (factor: 36); the array needs 330 MBytes
  5. 9980k entries/s (factor: 0.05!)

The source text can be found here: Swift, Python

Performance limit would be about 2000k lines/s (i.e. the extraction should be done in < 1 second).

It's hard to believe that compiled Swift programs are so much slower than interpreted Python scripts (actually, byte code), the significant exception being the locally computed run.

What can I do to improve the poor performance of the Swift program? Has anyone gained experience with this kind of problem?

What could generally be done to make the functionality more efficient?

  • Maybe the StreamReader implementation is not as efficient as it could be?
  • I could think of using fixed-length numerical fields to avoid scanning for the field separators
  • Or storing the numeric values in a compacted form in the textfile.
  • By ending the scan when the x value surpasses the matching interval, average times can be halved.
  • Searching for the interval by seeking into the file and using bisection algorithm could improve efficiency.
  • Or indexing the file by x values for navigation.
  • Or even bite the bullet and hold the data structure in memory after all — but this may become less practical when shifting to iOS.
  • Dependent on the context, parting the file in several subfiles could be appropriate, too.
  • What about using SQLite?
Community
  • 1
  • 1
Stefan
  • 1,036
  • 10
  • 32
  • 1
    `When testing my functions on an iMac` How? What's your process? Compiled in debug mode or release mode? With or without compiler optimizations? Which Swift methods were used? Without all these infos, a benchmark has no meaning. – Eric Aya Feb 29 '16 at 12:16
  • i would try to solve it using SQLite or better MySQL - because it's a classical task for a RDBMS systems - they are designed to solve such tasks. – MaxU - stand with Ukraine Feb 29 '16 at 12:22
  • @EricD.: Xcode 7.2.1, OS X command line tool project template, Debug configuration, straightforward implementation. I would not call my tests "benchmarks" for obvious reasons. But switching from Debug to Release, with optimizations enabled, does actually improve results in a significant way: factors are 1.8 / 1.4 / 1.7 / 1.1 / 55.5(!). There is no big difference between default optimization (Fastest, Smallest) and maximum optimization. – Stefan Feb 29 '16 at 12:41
  • `Debug configuration` Here's the cause for slowness. Actually: it's probably *one* of the multiple possible causes. Indeed, not a benchmark, so... what is the question then? Given that you actually don't *know* that it would be slower with Swift compared to Python... :) – Eric Aya Feb 29 '16 at 12:44
  • It's impossible to assess the performance of the two without seeing the code on both side. Did Python use some highly-optimized parser for CSV and Swift used something else, slower? Did you run it in Instruments to see what the bottlenecks are? How did you convert strings to number, `NSScanner`? Too many variables to give you a specific answer – Code Different Feb 29 '16 at 13:18
  • @EricD.: I'm a Swift newbie and I really like the language. Other favorite languages are Python and C#. I am developing an iOS application currently and stumbled upon the problem from which I distilled the question here. – Stefan Feb 29 '16 at 14:06
  • I'm just saying that this question is of no relevance because the speed comparison that's been made is biased, as I noted in previous comment. Now, my opinion is that with good code and correct compilation options, Swift will always be *much* faster than any interpreted language - and without proper compilation, it can indeed be similar - or even slower. – Eric Aya Feb 29 '16 at 14:11
  • Swift and Python source is now available, see links in the question text. – Stefan Feb 29 '16 at 14:42
  • @EricD.: I tend to believe the same, but that's exactly the question here: what would be "good code"? In Python, I do `len(f.readlines())` which is the simplest possible way to count lines from a file. In Swift, I either use the `StreamReader`class I linked to in my question, or I simply do `String(contentsOfFile: filename, encoding: ...).componentsSeparatedByString("\r\n").count` which, inexplicably, is slower by a factor of 10 even with optimization. Where's my error? – Stefan Feb 29 '16 at 16:21

0 Answers0