1

Currently, I'm doing:

    source_noise = np.fromfile('data/noise/' + source + '_16k.dat', sep='\n')
    source_noise_start = np.random.randint(
        0, len(source_noise) - len(audio_array))
    source_noise = source_noise[source_noise_start:
                                source_noise_start + len(audio_array)]

My file looks like:

  -5.3302745e+02
  -5.3985005e+02
  -5.8963920e+02
  -6.5875741e+02
  -5.7371864e+02
  -2.0796765e+02
   2.8152341e+02
   6.5398089e+02
   8.6053581e+02

.. and on and on.

This requires that I read the entire file, when all I want to do is read a part of a file. Is there any way for me to do this with Python that will be FASTER than what I'm doing now?

Shamoon
  • 41,293
  • 91
  • 306
  • 570
  • You could use the regular open function and then use file.readline() in a loop to only go over the first n lines. This does however require you to parse the data yourself as it will just return it as text. – Eric Dec 22 '19 at 14:55
  • I’m happy to parse it myself as it’s just numbers on each line. But I don’t want to necessarily do the first lines. – Shamoon Dec 22 '19 at 15:23
  • Is the performance actually an issue here? – AMC Dec 22 '19 at 18:10
  • Yes. I do this thousands of times across thousands of files, so I need it to be as quick as possible. – Shamoon Dec 23 '19 at 11:57
  • Please give some indication of the number of files, the number of samples in each and the number of samples you wish to read from each. – Mark Setchell Dec 26 '19 at 18:05
  • Have you abandoned this question? – Mark Setchell Dec 30 '19 at 12:55

2 Answers2

1

you can use the seek method to move inside file and read specific places.

file data -> "hello world"

start_read = 6

with open("filename", 'rb') as file:
    file.seek(start_read)
    output = file.read(5)
    print(output)

# will display world
0

Your file contains lines, therefore seek() by itself is almost useless because it is offsetting the file in bytes. That means you need to read the file very very carefully if you want correct results otherwise you'll end up without - sign or with a missing decimal digit or the text will be cut somewhere in the middle of the number.

Not to mention some quirks such as switching between the scientific notation eN vs pure floats which might happen if you dump to file wrong stuff too.

Now about the reading, Python allows you using readlines(hint=-1)

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

Therefore:

test.txt

123
456
789
012
345
678

console

>>> with open("test.txt") as f:
...     print(f.readlines(5))
...     print(f.readlines(9))
... 
['123\n', '456\n']
['789\n', '012\n', '345\n']

I haven't measured it, but that's probably the fastest in Python you can get if you don't want to handle your lines / don't want to get shot to foot by using seek() which might be even slower in the end due to suboptimal solution of parsing on your side.

I'm a little bit confused with "... from a specific location to a specific location?". In case the parsing is not intended, the solution might as well be just some bash script or similar thing but you have to known the number of lines in the file (an alternative to readlines(hint=-1) func):

with open(file) as inp:
    with open(file2) as out:
        for idx in range(num_of_lines - 1):
            line = inp.readline(idx)
            if not some_logic(line):
                continue
            out.write(line)

Note: the nesting of with is there only due to skipping the overhead of reading the whole file first and then checking + writing somewhere else.

Nevertheless you use numpy which is just a small step from Cython or C/C++ libraries. That means, you can skip the Python overhead and read the file with Cython or C directly.

mmap, mmap vs ifstream vs fread.

Here is an article actually doing measurements of:

  • Python code (readline()),
  • Cython (just dummy compilation),
  • C (cimport from stdio.h to use getline()(can't find C reference :/ ))
  • C++ (seems like wrongly marked as C in the plot)

This seems to be the most efficient code with some cleanup and pulling out the lines and it should give you an idea in case you want to experiment with mmap or other fancy reading. I don't have measurements for that though:

dependencies

apt install build-essential  # gcc, etc
pip install cython

setup.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name="test",
    ext_modules = cythonize("test.pyx")
)

test.pyx

from libc.stdio cimport *

cdef extern from "stdio.h":
    FILE *fopen(const char *, const char *)
    int fclose(FILE *)
    ssize_t getline(char **, size_t *, FILE *)

def read_file(filename):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string

    cdef FILE* cfile
    cfile = fopen(fname, "rb")
    if cfile == NULL:
        raise FileNotFoundError(2, "No such file or directory: '%s'" % filename)

    cdef char * line = NULL
    cdef size_t l = 0
    cdef ssize_t read
    cdef list result = []

    while True:
        read = getline(&line, &l, cfile)
        if read == -1:
            break
        result.append(line)

    fclose(cfile)
    return result

shell

pip install --editable .

console

from test import read_file
lines = read_file(file)
Peter Badida
  • 11,310
  • 10
  • 44
  • 90