Q : "How can I ... Any help would be much appreciated!"
A :
best follow the laws of the ECONOMY-of-COMPUTING :
Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.
Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )
I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics :
If we open a can of worms,
the only way how to put them back
is to use a bigger can
Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems
ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]
Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( Mx * Ny * ... )-scaling related issues in further modelling.
0.1 ns - CPU NOP - a DO-NOTHING instruction
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
3~4 ns - CPU L2 CACHE reference (2020/Q1)
7 ns - CPU L2 CACHE reference
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
2,500 ns - Read 10 kB sequentially from MEMORY------ HPC-node
25,000 ns - Read 100 kB sequentially from MEMORY------ HPC-node
250,000 ns - Read 1 MB sequentially from MEMORY------ HPC-node
2,500,000 ns - Read 10 MB sequentially from MEMORY------ HPC-node
25,000,000 ns - Read 100 MB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
250,000,000 ns - Read 1 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
2,500,000,000 ns - Read 10 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
25,000,000,000 ns - Read 100 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
500,000 ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
20,000,000 ns - Send 2 MB over 1 Gbps NETWORK
200,000,000 ns - Send 20 MB over 1 Gbps NETWORK
2,000,000,000 ns - Send 200 MB over 1 Gbps NETWORK
20,000,000,000 ns - Send 2 GB over 1 Gbps NETWORK
200,000,000,000 ns - Send 20 GB over 1 Gbps NETWORK
2,000,000,000,000 ns - Send 200 GB over 1 Gbps NETWORK
____________________________________________________________________________via_LAN_______________________________
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
10,000,000 ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
30,000,000 ns - DISK 1 MB sequential READ from a DISK
300,000,000 ns - DISK 10 MB sequential READ from a DISK
3,000,000,000 ns - DISK 100 MB sequential READ from a DISK
30,000,000,000 ns - DISK 1 GB sequential READ from a DISK
300,000,000,000 ns - DISK 10 GB sequential READ from a DISK
3,000,000,000,000 ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
Given these elements, the end-to-end computing strategy may and shall be improved.
AS-WAS STATE
... where the crash prevented any computing at all
A naive figure shows more than thousands words
localhost |
: file-I/O ~ 25+GB SLOWEST/EXPENSIVE
: 1st time 25+GB file-I/O-s
: |
: | RAM |
: | |
+------+ |IOIOIOIOIOI|
|.CSV 0| |IOIOIOIOIOI|
|+------+ |IOIOIOIOIOI|
||.CSV 1| |IOIOIOIOIOI|
||+------+ |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2| |IOIOIOIOIOI| 25+GB of .CSV
+|| | |IOIOIOIOIOI|~~~~~~~|
|| | |IOIOIOIOIOI|~~~~~~~|
+| | | |~~~~~~~|
| | | |~~~~~~~|
+------+ | |~~~~~~~|
... | |~~~~~~~|-> LAN SLOW
... | | WAN SLOWER
... | | transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
+------+ | | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
|.CSV 9| | |~~~~~~~~~~~~~~~| 2nd time 25+GB file-I/O-s
|+------+ | |~~~~~~~~~~~~~~~|
||.CSV 9| | |~~~~~~~~~~~~~~~|
||+------+ | |~~~~~~~~~~~~~~~|
|||.CSV 9| | |~~~~~~~~~~~~~~~|
+|| 9| | |~~~~~~~~~~~~~~~|
|| 9| | |~~~~~~~~~~~~~~~|
+| | | |~~~~~~~~~~~~~~~|
| | | |~~~~~~~~~~~~~~~|-> file-I/O into python
+------+ | | all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
| |***| 3rd time 25+GB file-I/O-s
| | RAM .CSV to df CPU work
| |***| df to LIST new RAM-allocation + list.append( df )-costs
| |***| + 25+GB
| |***|
many hours | |***|
[SERIAL] flow ...| |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
| crashed |***|
| on about |***|
| 360-th file |***|
| |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
| | CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
| |~~~~~| mem-I/O RAM| :: GB
| |~~~~~| mem-I/O |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
| |~~~~~| only if "HPC"
| |~~~~~| is *NOT* a "shared"-rental of cloud HW,
| |~~~~~| remarketed as an "HPC"-illusion
| |~~~~~|
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work the more ...>
| <...some amount of some usefull work the better ...>
| <...some amount of some usefull work as ...>
| <...some amount of some usefull work it ...>
| <...some amount of some usefull work dissolves to AWFULLY ...>
| <...some amount of some usefull work HIGH ...>
| <...some amount of some usefull work SETUP COSTS ...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 4th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|->file left on remote storage (?)
| |
| O?R
| |
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 5th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|
| |~~~~~|
| |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
| |????????????| 25+GB of results for repatriation
| |????????????| on localhost
| |????????????|
| |????????????|
| |????????????|-> LAN SLOW
| | WAN SLOWER
| | transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
| | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~~~~~~~~~~~| 6th time 25+GB file-I/O-s
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
SUCCESS ? | |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
after |
how many |
failed |
attempts |
having |
how high |
recurring|
costs |
for any |
next |
model|
recompute|
step(s) |
|
|
All |
that |
( at what overall |
[TIME]-domain |
& "HPC"-rental |
costs ) |
Tips :
- review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
- precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
- join the such "reduced" CSV-files, using a trivial O/S command, into a single input
- compress all data before transports ( a few orders of magnitude saved )
- prefer to code your computing using such algorithms' formulation, that can stream-process items along the data-flow, i.e. not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( like
.mean()
, .sum()
, .min()
, .max()
or even .rsi()
, .EMA()
, .TEMA()
, .BollingerBands()
... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )
- if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart
numpy
-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )
For more details on going into the direction of RAM-protecting memory-mapped np.ndarray
processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this :
>> print( np.memmap.__doc__ )
Create a memory-map to an array stored in a *binary* file on disk.
Memory-mapped files are used for accessing small segments of large files
on disk, without reading the entire file into memory. NumPy's
memmap's are array-like objects. (...)