I have a CSV-type Unicode text file that actually consists of two float values (x
and y
), an int (i
), and two strings (s
and t
) per line. The lines are sorted by the first float value but I do not make use of that yet.
The task is to extract all lines that have the float values in a specified range (xmin < x < xmax
and ymin < y < ymax
). Think of a geometrical search for surrounding items.
Since the file is rather large (1.3 million lines, 67 MBytes), I do not want to read it into memory all at once. Instead I used the StreamReader
class provided here.
For performance testing, I implemented several different functions:
- read the file line by line, ignoring the contents completely (just counting lines)
- same, but actually filter the matching lines (A)
- same, but filter (B)
- read the file line by line and store each line as an array element (Swift
struct
/ Python tuple) - using the array, filter by float comparison
Filtering (A) is done by string comparison instead of converting the string values to float numbers first. Filtering (B) does float conversion and float comparison. Conversion/comparison is done as necessary, i.e. first only x
is processed; y
is only processed if x
already matches.
When testing my functions on an iMac, I get the following results (approx.):
Default debug configuration (no optimization):
- 450k lines/s
- 120k lines/s
- 45k lines/s
- 18k lines/s; the array built up needs 300 MBytes
- 3480k entries/s
Default release configuration (optimized -Os):
- 780k lines/s (factor: 1.7 to Swift_Debug)
- 130k lines/s (factor: 1.1)
- 68k lines/s (factor: 1.5)
- 20k lines/s (factor: 1.1)
- 208000k entries/s (!) (factor: 59.8)
For comparison, doing the same in Python yields:
- 8140k lines/s (factor: 10 to Swift_Release)
- 4430k lines/s (factor: 34)
- 2010k lines/s (factor: 30)
- 710k lines/s (factor: 36); the array needs 330 MBytes
- 9980k entries/s (factor: 0.05!)
The source text can be found here: Swift, Python
Performance limit would be about 2000k lines/s (i.e. the extraction should be done in < 1 second).
It's hard to believe that compiled Swift programs are so much slower than interpreted Python scripts (actually, byte code), the significant exception being the locally computed run.
What can I do to improve the poor performance of the Swift program? Has anyone gained experience with this kind of problem?
What could generally be done to make the functionality more efficient?
- Maybe the
StreamReader
implementation is not as efficient as it could be? - I could think of using fixed-length numerical fields to avoid scanning for the field separators
- Or storing the numeric values in a compacted form in the textfile.
- By ending the scan when the
x
value surpasses the matching interval, average times can be halved. - Searching for the interval by seeking into the file and using bisection algorithm could improve efficiency.
- Or indexing the file by
x
values for navigation. - Or even bite the bullet and hold the data structure in memory after all — but this may become less practical when shifting to iOS.
- Dependent on the context, parting the file in several subfiles could be appropriate, too.
- What about using SQLite?