2

I am trying to add large number of (chunks) items to list. List can only have up to 536,870,912 items. And I have dataset of 1 billion lines of code and I am getting memory error, As I am trying to append 1 billion items to a list.

list = ["1","2","3","4",... 1 billion]
window_size = 4
db_chunk_hash = []
for i in range(len(list) - window_size + 1):
     c = list[i: i + window_size]
     i = bytes(str(c), encoding = 'utf8')     
     db_chunk_hash.append(i)

How to solve this problem?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
ya xi er
  • 85
  • 6
  • Are you sure about this? Look at this answer: https://stackoverflow.com/questions/855191/how-big-can-a-python-list-get - most systems are 64 bit now, on my own 64 bit Linux this number is humongous. – atru Dec 31 '22 at 15:31
  • I just evaluated `x = [1]*(10**9)` on a 64-bit Windows machine. It took a few seconds but worked as expected. That being said, you are going to run into the limits to what your machine can handle well before you run into the limits of what a 64-bit architecture can handle. If you are getting memory errors then you might already be at the limits of your machine. If so, a generator-based approach could be preferable (as the answer below suggests). – John Coleman Dec 31 '22 at 22:40
  • It can't help that you're copying items 0:3 then 1:4 then 2:5, and so on. – MatBailie Jan 01 '23 at 00:40

1 Answers1

1

You can use the Python generator function, that way you only use the data as needed and it is a better way on the memory.

def chunks(lst, window_size):
    for i in range(len(lst) - window_size + 1):
        c = lst[i: i + window_size]
        yield bytes(str(c), encoding='utf8')

window_size = 4
db_chunk_hash = chunks(list_t, window_size)
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
tomerar
  • 805
  • 5
  • 10