Your file contains lines, therefore seek()
by itself is almost useless because it is offsetting the file in bytes. That means you need to read the file very very carefully if you want correct results otherwise you'll end up without -
sign or with a missing decimal digit or the text will be cut somewhere in the middle of the number.
Not to mention some quirks such as switching between the scientific notation eN
vs pure floats which might happen if you dump to file wrong stuff too.
Now about the reading, Python allows you using readlines(hint=-1)
hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
Therefore:
test.txt
123
456
789
012
345
678
console
>>> with open("test.txt") as f:
... print(f.readlines(5))
... print(f.readlines(9))
...
['123\n', '456\n']
['789\n', '012\n', '345\n']
I haven't measured it, but that's probably the fastest in Python you can get if you don't want to handle your lines / don't want to get shot to foot by using seek()
which might be even slower in the end due to suboptimal solution of parsing on your side.
I'm a little bit confused with "... from a specific location to a specific location?". In case the parsing is not intended, the solution might as well be just some bash script or similar thing but you have to known the number of lines in the file (an alternative to readlines(hint=-1)
func):
with open(file) as inp:
with open(file2) as out:
for idx in range(num_of_lines - 1):
line = inp.readline(idx)
if not some_logic(line):
continue
out.write(line)
Note: the nesting of with
is there only due to skipping the overhead of reading the whole file first and then checking + writing somewhere else.
Nevertheless you use numpy
which is just a small step from Cython or C/C++ libraries. That means, you can skip the Python overhead and read the file with Cython or C directly.
mmap
, mmap
vs ifstream
vs fread
.
Here is an article actually doing measurements of:
- Python code (
readline()
),
- Cython (just dummy compilation),
- C (
cimport
from stdio.h
to use getline()
(can't find C reference :/ ))
- C++ (seems like wrongly marked as
C
in the plot)
This seems to be the most efficient code with some cleanup and pulling out the lines and it should give you an idea in case you want to experiment with mmap
or other fancy reading. I don't have measurements for that though:
dependencies
apt install build-essential # gcc, etc
pip install cython
setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(
name="test",
ext_modules = cythonize("test.pyx")
)
test.pyx
from libc.stdio cimport *
cdef extern from "stdio.h":
FILE *fopen(const char *, const char *)
int fclose(FILE *)
ssize_t getline(char **, size_t *, FILE *)
def read_file(filename):
filename_byte_string = filename.encode("UTF-8")
cdef char* fname = filename_byte_string
cdef FILE* cfile
cfile = fopen(fname, "rb")
if cfile == NULL:
raise FileNotFoundError(2, "No such file or directory: '%s'" % filename)
cdef char * line = NULL
cdef size_t l = 0
cdef ssize_t read
cdef list result = []
while True:
read = getline(&line, &l, cfile)
if read == -1:
break
result.append(line)
fclose(cfile)
return result
shell
pip install --editable .
console
from test import read_file
lines = read_file(file)