1

i implement each line 1 (not line 0) as string from 2 files (1st ~30MB and 2nd ~50MB) where line 0 has just some information which i dont need atm. line 1 is a string array which has around 1.3E6 smaller arrays like that ['I1000009', 'A', '4024', 'A'] as information in it.

[[['I1000009', 'A', '4024', 'A'], ['I1000009', 'A', '6734', 'G'],...],[['H1000004', 'B', '4024', 'A'], ['L1000009', 'B', '6734', 'C'],...],[and so on],...]

both files are in the same way filled. thats the reason why the files are between 30 and 50MB big. i read that files with my .py script to have access to the single information which i need:

import sys

myID        = sys.argv[1]
otherID     = sys.argv[2]

samePath        = '/home/srv/Dokumente/srv/' 
FolderName      = 'checkArrays/'
finishedFolder  = samePath+'finishedAnalysis/'
myNewFile       = samePath+FolderName+myID[0]+'/'+myID+'.txt'
otherFile       = samePath+FolderName+otherID[0]+'/'+otherID+'.txt'
nameFileOKarray = '_array_goodData.txt'

import csv 
import os 
import re #for regular expressions
# Text 2 - Start
import operator # zum sortieren der csv files
# Text 2 - End

whereIsMyArray    = 1
text_file         = open(finishedFolder+myID+nameFileOKarray, "r")
line              = text_file.readlines()[whereIsMyArray:];
myGoodFile        = eval(line[0])
text_file.close()

text_file         = open(finishedFolder+otherID+nameFileOKarray, "r")
line              = text_file.readlines()[whereIsMyArray:];
otherGoodFile     = eval(line[0])
text_file.close()

print(str(myGoodFile[0][0][0]))
print(str(otherGoodFile[0][0][0]))

the problem what i have is, that if i start my .py script over the shell:

python3 checkarr_v1.py 44 39

the RAM of my 4GB pi server increase to the limit of RAM and Swap and dies. then i tried to start the .py script on a 32Gb RAM server and look at that it worked, but the usage of the RAM is really huge. see pics

(slack mode) overview of normal usage of RAM and CPU: slackmode

(startsequence) overview in highest usage of RAM ~6GB and CPU: highest point

then it goes up and down after for ~1min: 1.2Gb to 3.6Gb then to 1.7Gb then to 1Gb and then the script finish ~1min and the right output was shown.

can you help me to understand if there is a better way to solve that for an 4Gb raspberry pi? is that a better way to write the 2 files, because the [",] symbols took also there spaces in the file? Is that a better solution as the eval function is to implement that string to an array? sry for that questions, but i cant understand why the 80MB files increase the RAM to around 6Gb. that sounds that i make something wrong. br and thx

user2
  • 31
  • 6
  • You can't have 1.3E9 items in a 50 MB text file. Even if each element is just 1 byte plus a comma, that would be 2.6GB. – Thomas Weller Nov 30 '20 at 10:49
  • @MauriceMeyer big thx. it is helpful. maybe i have to rethink my concept of using python – user2 Nov 30 '20 at 11:50
  • @ThomasWeller i dont know what you are calculatin, but i have a 50Mb file with 1.3E9 ['fist','A','B','C' ] array information in it. – user2 Nov 30 '20 at 11:52
  • 1E3 is 1000 or k, 1E6 is 1000*1000 or a million or M, 1E9 is 1000*1000*1000 or a billion or G. There's either a misunderstanding or you have a NTFS compressed file and you're looking at the compressed file size. – Thomas Weller Nov 30 '20 at 12:16
  • @ThomasWeller sure...now i see my fail. 1.3E6 not 9. sry :) – user2 Nov 30 '20 at 13:17

1 Answers1

0

1.3E9 arrays is going to be lots and lots of bytes if you read that into your application, no matter what you do.

I don't know if your code does what you actually want to do, but you're only ever using the first data item. If that's what you want to do, then don't read the whole file, just read that first part.

But also: I would advice against using "eval" for deserializing data. The built-in json module will give data in almost the same format (if you control the input format).

Still, in the end: If you want to hold that much data in your program, you're looking at many GB of memory usage.

If you just want to process it, I'd take a more iterative approach and do a little at the time rather than to swallow the whole files. Especially with limited resources.

Update: I See now that it's 1.3e6, not 1.3e9 entries. Big difference. :-) Then json data should be okay. On my machine a list of 1.3M ['RlFKUCUz', 'A', '4024', 'A'] takes about 250MB.

Mattias Nilsson
  • 3,639
  • 1
  • 22
  • 29
  • 1.) the code does what it has to do. it is not really spectacular what it does. 2) what do you mean with "using the first data item"? if you mean to read just the first 200 parts of that string and convert just that part to the array then forget it! i need the hole array for comparing it with a mysql database. 3) thx for the json link. think thats the reason. i thought that something is wrong with that eval function. big thx! 4) yes i think i need really much GB of RAM for doing this for more then 100 useres at same time 5) that with a iterative approach i didnt understand. some example? – user2 Nov 30 '20 at 12:02
  • @user2 2.) the `myGoodFile[0][0][0]` seems to only be accessing the very first data. 5.) Iterative approach means reading a little at the time rather than loading everything into memory at once. For example, if you had a different layout of the file where each "section" was one line, you could easily just process one line at the time and then throw it away. That way you would not need to hold all the data in memory at once. – Mattias Nilsson Nov 30 '20 at 14:00
  • you mean something like each line has his own array like [['RlFKUCUz', 'A', '4024', 'A'],['2', 'A', '4111', B'],['bla', 'X', '4024', 'C'], ....] where just the Arrays for A is for one line and B for the other and so on? then i could have between 50k and 70k for just one single line and not all of it. its a oppertunity but i think i will try first that JSON trick. maybe it works, but first i have to figure out how to do that and rewrite much of my old code. thx you and @MauriceMeyer. i got an 2nd way – user2 Nov 30 '20 at 22:21