3

Having searched around for a bit (e.g. here, here, and here), I'm at a loss. How do I get Python 3.7 to use more than 2 GB memory?

Info about my setup: I'm running 64-bit PyCharm (2019.2.6) with 64-bit Python 3.7.5, and I've set my -Xms=8g and -Xmx=16g in pycharm.vmoptions (as this suggests to set Xms to half of Xms). This is running on macOS Catalina 10.15.3, on a machine with 40 GB ram (2*4 + 2*32).

What I'm trying to do, and why I want to increase memory use: I'm reading relatively large timeseries (200-400 columns, around 70 000 rows) into Pandas (v. 0.25.3) dataframes from .txt-files (file size ranges from 0.5 GB to 1.5 GB), and working with 10-15 of these files at a time. As I'm reading in the files I see the python3.7 process increase memory up to around 2 GB (sometimes 2.05 GB), before memory use is decreased to a few hundred MBs and increased up towards 2 GB again (and repeat).

When I'm working with these timeseries [slicing, plotting, etc.], everything takes a relatively long time (a few minutes). I'm hoping this can be improved by increasing memory usage. However, if I am wrong in my assumption that increased RAM usage in the python process would improve performance then please let me know

Prebsus
  • 695
  • 9
  • 17
  • 3
    -Xms and -Xmx are JVM options. Python don't have memory restrictions AFAIK – geckos Mar 28 '20 at 15:44
  • Ok @geckos, but my understanding was that PyCharm runs on a JVM, and hence setting -Xms and -Xmx would influence the memory allocation to PyCharm (see: https://www.jetbrains.com/help/pycharm/tuning-the-ide.html). Why would it appear that Python will only use 2 GB of RAM? – Prebsus Mar 28 '20 at 15:48
  • Your memory usage (2 GB of RAM) will be related to the file you are reading and storing in your variable (memory) – jammin0921 Mar 28 '20 at 15:51
  • Sorry, I realized I left a key part out of the Q - I'm reading 13 of these files, so when I combine my data I would like to keep all ~10 GB in memory – Prebsus Mar 28 '20 at 15:52
  • 2
    The memory used by PyCharm (the IDE) is different from the memory used by Python (the interpreter). And yes, 10GB is quite a lot, think about reading only parts of the data in memory, or writing partial results to files, ask yourself if you really, really need the whole thing in memory at the same time. – Óscar López Mar 28 '20 at 15:56
  • @ÓscarLópez: Fair enough - my desire to look at it 'all at once' comes from what the data is describing: timeseries of loads on a number of different mechanical units over a 14 month period. Understanding these load patters may help me explain why some units have failed and others have not, and hence determine what kind of limitations we are working with. But I guess I'll just have to go a bit more step-by-step on this instead of 'all-at-once' – Prebsus Mar 28 '20 at 16:00
  • One option that comes to mind in working with large amount of data is to use a jupyter notebook. Jupyter notebook uses "cells" to run so you can run your data read into a variable one time in one cell and move to another cell to analyze the data. You won't have to re-read the data as long as you don't write over the variable used to store the data. I do not know if PyCharm has that capability. – jammin0921 Mar 28 '20 at 16:02
  • 1
    Here is another option that might be beneficial to you if you wish to stay in PyCharm: https://stackoverflow.com/questions/23441657/pycharm-run-only-part-of-my-python-file – jammin0921 Mar 28 '20 at 16:04
  • @jammin0921: I'll try that! – Prebsus Mar 28 '20 at 16:05
  • What does `ts.memory_usage(True, True)` tell you for a time series in an average file? – Kelly Bundy Mar 28 '20 at 16:05
  • @HeapOverflow: 539144, i.e.: when I print the output of `ts.memory_usage(True, True) I get an two columns: once which lists all the column-names in the ts, and the other lists "539144" for every column-name. – Prebsus Mar 28 '20 at 16:09
  • 1
    So then for 300 columns, that's only 162 MB? – Kelly Bundy Mar 28 '20 at 16:11
  • 1
    That's what I suspected, yes. I barely know any Pandas, though :-). But I can imagine it being more compact than in the text files (depends on how bloated those are). – Kelly Bundy Mar 28 '20 at 16:48

1 Answers1

2

Thank's to many helpful comments (geckos, jammin0921, Óscar López, and Heap Overflow) it looks like what I was observing was not a limitation of Python, but rather that the apparently clever data-management by Python/Pandas meant that once the 12 GB of .txt-files had been read into DataFrames, their total size was actually below 2 GB, by looking at the memory usage of the dataframe (df): df.memory_usage(True, True).sum() which gave 1.9 GB

Having tried to manipulate this further by further increasing size of the data I read in, I do see RAM usage above 2 GB from python3.7 process.

Prebsus
  • 695
  • 9
  • 17