-3

I'm working on a Python project that uses txt files. It reads a few very long text files (a few thousands of lines - but could me more that that - that are UTF-8 encoded) into a list, manipulates them quite a lot, and then writes them back to a file.

I was wondering if this is the correct way to do such a thing. That is, is the limit of texts that Python could hold in memory reachable with this amount of texts? Should I take reaching the limit into account (and if so, how do I take it into account)?

Cheshie
  • 2,777
  • 6
  • 32
  • 51
  • This is unlikely to be language specific. However, many text programs have trouble being handled when they grow beyond 2 GB in size and may then require special consideration. – BlackVegetable Apr 30 '14 at 15:56
  • Depends on your ram more than python itself. We have no idea how you're trying to manipulate the text, so there's not enough for us to go on. Maybe show the code you're trying – mhlester Apr 30 '14 at 15:56
  • @mhlester - the text manipulation is merely adding a few characters and then turning them into vectors (of numbers). I didn't think it was relevant for the question... – Cheshie Apr 30 '14 at 16:09
  • Where it would matter is for instance if you needed to sort the text, you'd want to load it all in memory, whereas processing one line at a time wouldn't require that. – mhlester Apr 30 '14 at 16:15
  • I see. Well, it could be done one line at a time, but I suspect that loading a sentence to memory, manipulating it, and writing it back loads of times would be slow... wouldn't it? – Cheshie Apr 30 '14 at 17:57

1 Answers1

6

You correctly recognize, that holding content of many files in memory has its own costs and limits.

Python is excellent just in the opposite - looping through many items (files, records, whatever) while holding in memory only what is really relevant.

There are concepts called iterators and generators, one sample is xrange. Instead of creating all the numbers, range(large_number) would require to hold in memory, xrange(large_number) is providing numbers one by one, keeping in memory only what is needed for producing next one.

The same way you can read files and process their content. Of course, if you need information from wider context, you need to get it from somewhere, but in general, many real use cases do not require that you have all in memory and than sum it all up to get proper result.

For further work, I would point you to following terms:

  • generator
  • iterator
  • module itertools

All is in Python documentation and there is bunch of nice tutorials all around the web.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • Thanks @Jan. Initially I wrote my code iterating over the lines in the file (and writing them back separately, as you say). The problem is that it's extremely slow, so I thought that reading and writing to files as less as possible would help. Meaning that I need all the data in memory while running - iterating, I suppose, wouldn't help. Which is why I asked my question. Any suggestions? – Cheshie Apr 30 '14 at 18:05
  • Unless you explicitly read data byte by byte, iterations are usually not significantly slower than reading (some system cache behind takes care of this stuff anyway). I would suspect some inefficiency inside the loop, with looping it sums up into slow run. Consider profiling or posting new question. – Jan Vlcinsky Apr 30 '14 at 18:08