4

Many coding challenges have multiple numbers in the same line, often with a first line telling how many numbers there are in the multi-number line:

4
31 415 9 26

Usually I just read the entire line, then .split() it and map the strings to numbers.

But is there a good way to not read the entire line at once, instead read one number at a time? In order to save memory, either because I can't or don't want to read the whole line into memory. I'd like to only use O(1) space (let's assume numbers are small/bounded so they're O(1) size). Doesn't have to be absolutely minimal, for example if a solution internally reads a full 4 KB memory page at a time, that's ok, still O(1) and relatively small. For use case, think millions of numbers on one line, and a memory limit let's say below 1 MB.

In C++ I'd do it like this:

int n;
std::cin >> n;
while (n--) {
    int value;
    std::cin >> value;
    // now do something with the value
}

I wrote this generator that takes a file object and gives me an iterator of strings. For the above example, it yields the strings '4', '31', '415', '9' and '26'. It reads one character at a time, and splits at space characters as determined by .isspace():

def split(file):
    value = []
    while char := file.read(1):
        if char.isspace():
            if value:
                yield ''.join(value)
            value.clear()
        else:
            value.append(char)
    if value:
        yield ''.join(value)

But that's of course ridiculously complicated and slow, and I don't even know whether this str.isspace usage is equivalent to what str.split considers whitespace. It just illustrates one way to achieve what I want.

Edit: Here's a simpler way, but still more complicated and slow than I'd like. I'm looking for some built-in way that does the low-level work for me, at C speed.

from itertools import groupby

def split(file):
    groups = groupby(iter(lambda: file.read(1), ''), str.isspace)
    for isspace, chars in groups:
        if not isspace:
            yield ''.join(chars)
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • Does [python readline with custom delimiter](https://stackoverflow.com/questions/51980776/python-readline-with-custom-delimiter) answer your question? (Make spaces be treated as a line delimiter, and there you are). – Charles Duffy Nov 14 '20 at 16:38
  • Have you run into a case when the memory is not enough for using just reading the line and using split? – Dani Mesejo Nov 14 '20 at 16:39
  • 1
    You could use `re.finditer(r"\w+", s)` which will provide an iterator. – Mark Nov 14 '20 at 16:41
  • 1
    @CharlesDuffy No. I had actually looked into that, but `' '` is not allowed as delimiter. I tried, gets `ValueError: illegal newline value: `. – Kelly Bundy Nov 14 '20 at 16:42
  • @MarkMeyer And what is `s`? – Kelly Bundy Nov 14 '20 at 16:42
  • https://stackoverflow.com/questions/510357/how-to-read-a-single-character-from-the-user may be? – Equinox Nov 14 '20 at 16:44
  • @DaniMesejo No, but I've run into a case where I didn't *want* to do that. For a coding challenge where I'd otherwise only need O(1) space. – Kelly Bundy Nov 14 '20 at 16:44
  • Hmm. Does read-only file-mapped memory count against your space usage, for purposes of this challenge? :) – Charles Duffy Nov 14 '20 at 16:45
  • @venky__ Seems similar to what I used in my generator, so has the same issues. – Kelly Bundy Nov 14 '20 at 16:45
  • @MarkMeyer 's suggestion to use `finditer` would be good. It still reads one by one but it would do so in compiled code and you're only reading 3 or so at a time anyway. – Anonymous1847 Nov 14 '20 at 16:47
  • In fact, why do you want to avoid reading one by one? At best you could be reading three by three, which isn't much better. – Anonymous1847 Nov 14 '20 at 16:48
  • @CharlesDuffy I guess you mean memory-mapped file? :-) I'm not exactly familiar with that, but if that takes only a relatively small amount of memory, that'd be ok, yes. I considered trying that in combination with `re.finditer`, but that's documented as wanting a string, so I gave that up. – Kelly Bundy Nov 14 '20 at 16:49
  • Heh. The main point of mmap'ing your input file is that it isn't really memory allocated to your process at all; instead, you're getting pages from the operating system's block cache. So it's really a question of how whatever tools are being used to measure your answer's memory allocation work; it's _conceivable_ that they wouldn't recognize a mmap operation as using any memory at all. Could be argued to be cheating, but isn't that part of what makes it fun? :) – Charles Duffy Nov 14 '20 at 16:50
  • Just Curious if the challenge is online can you share link. normally python has roughly 5 to 10 time more allotted time and 2x? extra memory. – Equinox Nov 14 '20 at 16:50
  • @Anonymous1847 With "avoid reading one by one", I guess you mean one *character* at a time like my function does? I don't care if something like `finditer` does that internally. If it reads let's say 4 KB (one memory page) at a time, that'd be ok, still O(1) and relatively small. – Kelly Bundy Nov 14 '20 at 16:54
  • @venky__ [Here's my answer](https://codereview.stackexchange.com/a/252059/219610) where I could write a solution that takes only O(1) space *if* I had a good way to do what I'm asking for here. The link to the problem is in the question there. That's the case I mentioned to DaniMesejo above. Like I said it's not a case where I *can't* do it but where I don't *want* to. In that case there are actually only up to 100 numbers. But imagine cases with *millions* of numbers (and a memory limit let's say below 1 MB). – Kelly Bundy Nov 14 '20 at 17:02
  • See if this helps: https://stackoverflow.com/questions/20019503/how-to-read-tokens-without-reading-whole-line-or-file – Dani Mesejo Nov 14 '20 at 17:33
  • By the way instead of using read(1) you could increase it to read(10) and this certainly will speed up your current approach – Dani Mesejo Nov 14 '20 at 17:35
  • @DaniMesejo That mmap+re one looks good, if it indeed doesn't take much memory. So I guess the `finditer` documentation lied to me (or at least is misleading) when it said it wants a string. And yes, I already tried reading larger chunks and using `.split()` on them and re-joining appropriately, but that became a nightmare. That's what I meant when I said I used n=1 to make handling the data easier. – Kelly Bundy Nov 14 '20 at 17:46
  • @python_learner Why/how would they "not count this as memory"? I don't think that's true. – Kelly Bundy Nov 14 '20 at 17:53
  • @HeapOverflow I realized the question was linked in code review, and reading inputs is not considered as part of your algorithm's memory, if your code fails its most likely your algorithm needs working, I am of course talking from personal experience – python_user Nov 14 '20 at 17:56
  • @python_learner I guess you mean like Topcoder or LeetCode, where you really don't read the data but would be given a list of ints already? Yeah, in those cases I wouldn't count the memory, either. But what I'm asking about here and what the question of that code review is about, that includes reading the input, so there it does count. – Kelly Bundy Nov 14 '20 at 18:01
  • @DaniMesejo Hrmpf, got that mmap+re one to work, but only with `bytes`, not with `str`. So that's not optimal, though for the stated usage (coding challenges with ASCII numbers and whitespace) it suffices. Will have to experiment for speed and memory usage... – Kelly Bundy Nov 14 '20 at 18:41
  • So would a solution using read(n) with n > 1 will help you? – Dani Mesejo Nov 14 '20 at 18:44
  • @DaniMesejo Perhaps. Depends on how complicated/fast it is :-) – Kelly Bundy Nov 14 '20 at 19:00
  • @DaniMesejo I can of course read larger chunks and then still process them one character at a time. That makes it faster, but also even more complicated. I've done some benchmarks now, irritatingly my original generator seems to be only about ten times slower than `file.read().split()` (not a valid solution for the question, but I'm taking it as baseline for how fast a solution could possibly be). – Kelly Bundy Nov 14 '20 at 19:15

0 Answers0