1

So the course that I am teaching has a requirement to teach the concept of random files - the course content specifies that the files are of a fixed size/length, each location contains a record, and the location in which to store/from which to read is determined through a hashing function (with collisions dealt with in a number of ways). While I am happy with the theory, and the pseudocode that is used to explain this concept, I have to admit, I am struggling with turning this into suitable Python code.

What I need is to be able to

  • use a key value and a hashing function to determine a line number in a file (I can do this part)
  • jump to that line/location in a specified file
  • amend or read the data on that line/at that location

I have seen a couple of mentions of pickling and mmap'ing when doing a bit of research but not sure if this would be the best approach. Very grateful for any guidance.

Robert Flook
  • 307
  • 6
  • 12
  • `with open('filename.txt', 'r') as f: content = f.readlines()` content is now a list line by line. is that what you're looking for? – Paritosh Singh Dec 07 '18 at 16:10
  • 2
    Reading a line at an arbitrary location is easy. Updating it is impossible without reading the entire file and rewriting it if the new data is larger than the old. – Mark Ransom Dec 07 '18 at 16:10
  • @ParitoshSingh: Not much randomness in your example. – Robert Harvey Dec 07 '18 at 16:11
  • Checkout [mmap](https://docs.python.org/2/library/mmap.html) which will map a file into your main memory where you can access it in O(1). – Thomas Lang Dec 07 '18 at 16:12
  • You may want to fine-tune your terminology. "Random file": a file that contains random data/any single file on your system. "Random **Access** file": a file where you can access every item you want. – Jongware Dec 07 '18 at 16:15
  • 1
    Do you mean random-*access* file? Think about the interface it provides; whether or not the data resides in memory or on disk is secondary, and hardly relevant here. – chepner Dec 07 '18 at 16:15
  • See this link and consider using linecache. This will surely fit your needs. https://stackoverflow.com/questions/4999340/python-random-access-file – Nabil Dec 07 '18 at 16:19
  • @usr2564301, yes absolutely correct on your part. I was blind copying the syllabus term. – Robert Flook Dec 08 '18 at 17:27
  • @ParitoshSingh thanks but no, I know how to do that, I want to be able to open the file, not read everything, jump directly to a specific location and then read the data/line at that location or amend that data/line. – Robert Flook Dec 08 '18 at 17:27
  • @chepner - not quite sure what you mean by 'hardly relevant here'? – Robert Flook Dec 08 '18 at 17:27
  • @MarkRansom is there absolutely no way to jump to a specified line/point in the file/data and then amend the data without having to read all the data from the file, amend the relevant data and then write all the data back again? – Robert Flook Dec 08 '18 at 17:28
  • 1
    @RobertFlook If the records are of a *fixed* size, you can indeed jump to a particular record (the offset would be the record size times the record number) and overwrite it with a different record. For an ordinary text file, the size of each line might vary, and the only way to replace one line with a line of different size is to rewrite both the replaced line *and* every line following it (to make room for the extra bytes or to fill the space left by the missing bytes). – chepner Dec 08 '18 at 17:41
  • @RobertFlook By hardly relevant, I mean that the *interface* to such a record file and a list is basically the same; both are just sequences of bytes. – chepner Dec 08 '18 at 17:41

1 Answers1

1

The problem can be divided in two halves:

  1. Deciding a (binary?) fixed length format for your records, and being able to serialize/deserialize your data from it; the result must be a fixed length bytestring.
  2. Seeking and reading/writing such records on file.

For point 1, there are many possibilities. You can use the struct module to generate/read binary data that is generally length.

A way lower technology, but still valid, possibility is just to work with fixed-size text records, each field adequately padded with whitespace or whatever. These can be easily generated with bytes.format and split when reading using plain slicing.

Just be careful that, for this to work right, your fields must be formatted/padded adequately as byte strings, not unicode ones (if you were to compose the record as a Unicode string and then convert it to UTF-8 it may change length, as UTF-8 is a variable length encoding).

As for the second part, it's the easiest: just open the file in binary mode (you don't want newline translation to mess with your bytes), use the seek method to move to the record you need (using as position the number of record multiplied by its size) to read/write, and then use read (passing the record size) or write (passing an appropriately sized record).

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299