3

I am trying to read several (>1000) .txt files (on average approx. 700 MB, delimited, header-less CSV, without commas or other separator) and merge them into one pandas dataframe (to next run an analysis on the entire dataset).

I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, as I have never done this before).

So far, my idea was this:

li = []

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

but after a few hours and having loaded the approx. 360th file the process gets killed and I get the error message:

numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64

Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of .txt files!?)

Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?

Any help would be much appreciated!

user3666197
  • 1
  • 6
  • 50
  • 92
  • 3
    The error is very suspicious. This means you are trying to load a CSV with 15 millions columns and 10 rows. Having so many columns in a CSV file is very unusual and CSV files are clearly inefficient for such a case. How big are your input files: what is they shape and the average size of the files on the storage device? – Jérôme Richard Feb 13 '22 at 13:10
  • The input files are delimited .txt files. I am just using read_csv because I thought this is how I can directly load them into a pandas dataframe. They usually have 15 columns and lots of rows and the average size is about 700.000 kb. Some columns are blank so I guess they are dropped automatically? But yeah, no idea why it says (10, 14921599) instead of (14921599,10) !? – Markus Eyting Feb 13 '22 at 14:01
  • 3
    Ok, so I thing your input file is wrongly parsed. The error can be due to a badly formatted file (eg. issue with CR LF characters, undelimited strings), wrong delimiters or a bug in the parser. Can you check the file is properly formatted or post a link to such the file that cause the issue? (the first lines might be sufficient to cache the issue and find possibly better parsing functions) – Jérôme Richard Feb 13 '22 at 15:34
  • Excuse, Markus, my wrong interpretation of the decimal point in transcribing the "700.000 kb" piece of information. Now, seeing 700 MB/file + having more than 1,000 such files in less than 50 GB free-RAM space - side by side with the same amount of data replicas, stored in growing list, from which it ought finally get into super-DataFrame-instance, I can't guess the reason, how to make ~ 1000 x 700 [MB] ~~ 700 [GB] ( the more as each un-managed (default) dtype expands each .TXT number item into 8 [B] 64-bit-rich, even a bool { 0 | 1 } will take a full float64, if not managed) to "fit" 50 GB – user3666197 Feb 14 '22 at 13:22
  • + the sep=None directive makes the .CSV decoder fail, letting the Python-parser gues - *"...Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'."* – user3666197 Feb 14 '22 at 13:26
  • Thanks @JérômeRichard for your comment. Unfortunately I cannot provide a link to the file online (due to strict data privacy requirements), but I checked the formatting and I cannot see anything unusual. Also, using Omid's suggestion to append the files to a .csv file works so far (process still running but the input file that got the process killed last time went through this time). The job step that includes Omid's code got killed once after 10 hours (without any reason given in the .out file) but I restarted it for the remaining input files and it is still running now. – Markus Eyting Feb 14 '22 at 19:58
  • 1
    @MarkusEyting I am confuse about the "700.000 kb" too and the "0.7 MB" now edited to 700 MB. If you try to load >1000 files of 700 MB *each*, then it is normal your process got killed since you do not have enough RAM like user3666197 said. You could perform some big-data computation (ie. not in RAM) but this is a totally different problem. Can you clarify this point? – Jérôme Richard Feb 14 '22 at 20:36
  • Yes, sorry, it is 700,000 KB, so 700 MB. @user3666197 edited my post to 0.7 MB before but that was wrong. Sorry for the confusion. The csv file now has roughly 150 GB after appending about 900 txt files (process still running) because of each .txt file I now only read two columns and append them to the csv file, using ```usecols=["X", "Y"]```. The things that I wonder about is, if it is a RAM problem, why does the process continue to run since a few hours now? The csv file only got bigger since the process got killed. Do you have any idea? – Markus Eyting Feb 14 '22 at 21:12
  • 1
    @MarkusEyting Ok. First of all, memory is progressively allocated so this is why the process does not directly crashes: each CSV takes some memory space and it progressively fill the RAM as they are loaded. Moreover, the final `pd.concat` allocates an additional space so to copy the output CSV so you need twice the space. Finally, computing machine like some PC have a swap memory. Swap memory is an additional amount of *virtual* memory that is stored in (much much slower) storage devices. When there is not enough space, the OS becomes very slow and kill some processes (critical case). – Jérôme Richard Feb 14 '22 at 21:22
  • For completeness, this is the updated code snippet that loads the files: `for filename in all_files:` `with open("./log.txt", "r+") as f:` `if not filename in f.read():` `df = pd.read_csv(filename, sep=None, header=0, engine="python", usecols=["X", "Y"])` `df.to_csv("./data.csv", header=None, index=None, mode="a")` `f.write(filename+ "\n")` @JérômeRichard thanks! If each CSV progressively fills the RAM as they are loaded, can I not clear the RAM each time a CSV is loaded & appended to make room for the next one? – Markus Eyting Feb 14 '22 at 21:35
  • 1
    @MarkusEyting You can only "clear the RAM" if your data are stored in a storage device (or computed, or even sent somewhere). But is this what you want: do you plan to write the concatenated `df`? If so, then the answer of OmidRoshani does the job (although there is probably a faster way to do that). Note that you should call `del df` after its loop so to reduce the memory footprint. Note that this operation is slow because of file storages and CSV parsing. – Jérôme Richard Feb 14 '22 at 21:45
  • @JérômeRichard Yes, it is stored in a storage device. I do plan to write the concatenated `df` afterwards. Thanks a lot for your help. – Markus Eyting Feb 14 '22 at 21:49
  • @JérômeRichard - with all respect, Sir, your advice to re-use Omid Roshani code is wrong. Yes, it did reduce the most horrific HPC anti-pattern ( doubling the RAM footprint with not keeping all individual df-s in-RAM before and during the concat starts converting list of df-s into a new super-dataframe ), yet you forget that even the single RAM footprint (~150 GB) goes way beyond the physical RAM (besides 10000x slower data access times, the swapping blocks any and all other file-I/O operations + (for obvious reasons) any computing (by minutes-long blocked mem-I/O channels). Not good in HPC – user3666197 Feb 15 '22 at 06:47
  • + all that data-curation shall've been performed using HPC-grade tools, not a repetitive python code re-interpretation of a low-performance, low-efficiency loop, multiplying all the few SLOC code sins by milions, until it crashes or takes hundreds times longer, than technically needed. a) TXT file merging is best done using powerful unix shell tools, b) column-wise operations shall have been pre-computed, to avoid pythonic column-riding all the 150-GB row-wise "table" (another performance anti-pattern, solvable below, having a pair of .memmap/files, one row-wise, other column-wise - if needed) – user3666197 Feb 15 '22 at 06:56

2 Answers2

1

as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    df.to_csv('./data.csv', header=None, index=None, mode='a')

after saving all of the files in single file now you can read the single file as a dataframe like this:

df = pd.read_csv('./data.csv',header=None,index_col=False)

after that if you have any issues with reading this file because of memory you can use a reader like this:

chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
    for chunk in reader:
        # Do What you want
Omid Roshani
  • 1,083
  • 4
  • 15
  • With all due respect, the proposal is not solving the root-cause of the use-case ill-performing the goal. File-I/O costs for python run a "merge" of from-file-to-file data is too expensive. The segmented, chunk-based, iterator for any next re-reading the supra-dataframe file is not good either - some ML-type-of-work run row-based ( where chunks need not hurt us so much ), yet other calculations ( scalers, bias-removal & other ) run column-based, where row-based chunk-iterators will ill-perform & are almost rather a wished to get HPC performance anti-pattern. Some way other design needed here – user3666197 Feb 14 '22 at 10:58
0

Q : "How can I ... Any help would be much appreciated!"

A :
best follow the laws of the ECONOMY-of-COMPUTING :

Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.

Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )

I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics :

If we open a can of worms,
the only way how to put them back
is to use a bigger can

Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems


ELEMENTARY RULES-of-THUMB ... how much we pay in [TIME]

Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( Mx * Ny * ... )-scaling related issues in further modelling.

                0.1 ns - CPU NOP - a DO-NOTHING instruction
                0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
              3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                7   ns - CPU L2  CACHE reference
               19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
               71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
              100   ns - own DDR MEMORY reference
              135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
              325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
            2,500   ns - Read  10 kB sequentially from  MEMORY------ HPC-node
           25,000   ns - Read 100 kB sequentially from  MEMORY------ HPC-node
          250,000   ns - Read   1 MB sequentially from  MEMORY------ HPC-node
        2,500,000   ns - Read  10 MB sequentially from  MEMORY------ HPC-node
       25,000,000   ns - Read 100 MB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
      250,000,000   ns - Read   1 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
    2,500,000,000   ns - Read  10 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
   25,000,000,000   ns - Read 100 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|
          500,000   ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
       20,000,000   ns - Send   2 MB over 1 Gbps  NETWORK
      200,000,000   ns - Send  20 MB over 1 Gbps  NETWORK
    2,000,000,000   ns - Send 200 MB over 1 Gbps  NETWORK
   20,000,000,000   ns - Send   2 GB over 1 Gbps  NETWORK
  200,000,000,000   ns - Send  20 GB over 1 Gbps  NETWORK
2,000,000,000,000   ns - Send 200 GB over 1 Gbps  NETWORK
____________________________________________________________________________via_LAN_______________________________
      150,000,000   ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
       10,000,000   ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
       30,000,000   ns - DISK   1 MB sequential READ from a DISK
      300,000,000   ns - DISK  10 MB sequential READ from a DISK
    3,000,000,000   ns - DISK 100 MB sequential READ from a DISK
   30,000,000,000   ns - DISK   1 GB sequential READ from a DISK
  300,000,000,000   ns - DISK  10 GB sequential READ from a DISK
3,000,000,000,000   ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|

Given these elements, the end-to-end computing strategy may and shall be improved.


AS-WAS STATE
... where the crash prevented any computing at all

A naive figure shows more than thousands words

localhost          |
:       file-I/O ~ 25+GB SLOWEST/EXPENSIVE
:         1st time 25+GB file-I/O-s
:                  |
:                  | RAM       |
:                  |           |
+------+           |IOIOIOIOIOI|
|.CSV 0|           |IOIOIOIOIOI|
|+------+          |IOIOIOIOIOI|
||.CSV 1|          |IOIOIOIOIOI|
||+------+         |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2|         |IOIOIOIOIOI|               25+GB of .CSV
+||      |         |IOIOIOIOIOI|~~~~~~~|
 ||      |         |IOIOIOIOIOI|~~~~~~~|
 +|      |         |           |~~~~~~~|
  |      |         |           |~~~~~~~|
  +------+         |           |~~~~~~~|
  ...              |           |~~~~~~~|-> LAN  SLOW
   ...             |                   |   WAN  SLOWER
    ...            |                   |   transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
    +------+       |                   |               |                   30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
    |.CSV 9|       |                   |~~~~~~~~~~~~~~~|                                     2nd time 25+GB file-I/O-s
    |+------+      |                   |~~~~~~~~~~~~~~~|
    ||.CSV 9|      |                   |~~~~~~~~~~~~~~~|
    ||+------+     |                   |~~~~~~~~~~~~~~~|
    |||.CSV 9|     |                   |~~~~~~~~~~~~~~~|
    +||     9|     |                   |~~~~~~~~~~~~~~~|
     ||     9|     |                   |~~~~~~~~~~~~~~~|
     +|      |     |                   |~~~~~~~~~~~~~~~|
      |      |     |                   |~~~~~~~~~~~~~~~|-> file-I/O into python
      +------+     |                                   |   all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
                   |                                   |***|                        3rd time 25+GB file-I/O-s
                   |                                   |   RAM .CSV to df CPU work
                   |                                   |***|           df to LIST new RAM-allocation + list.append( df )-costs
                   |                                   |***|                    + 25+GB
                   |                                   |***|
many hours         |                                   |***|
  [SERIAL] flow ...|                                   |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
                   |                       crashed     |***|
                   |                    on about       |***|
                   |                       360-th file |***|
                   |                                   |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
                   |                                       |  CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
                   |                                       |~~~~~| mem-I/O RAM|              :: GB
                   |                                       |~~~~~| mem-I/O    |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
                   |                                       |~~~~~|                                      only if "HPC"
                   |                                       |~~~~~|                                           is *NOT* a "shared"-rental of cloud HW,
                   |                                       |~~~~~|                                                    remarketed as an "HPC"-illusion
                   |                                       |~~~~~|
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work         the more                        ...>
                   |                                            <...some amount of some usefull work         the better                      ...>
                   |                                            <...some amount of some usefull work             as                          ...>
                   |                                            <...some amount of some usefull work             it                          ...>
                   |                                            <...some amount of some usefull work             dissolves to AWFULLY        ...>
                   |                                            <...some amount of some usefull work                          HIGH           ...>
                   |                                            <...some amount of some usefull work                          SETUP COSTS    ...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                                          |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                          |~~~~~|          4th time 25+GB file-I/O-s
                   |                                                          |~~~~~|
                   |                                                          |~~~~~|->file left on remote storage (?)
                   |                                                                |
                   |                                                               O?R
                   |                                                                |
                   |                                                                |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                |~~~~~|          5th time 25+GB file-I/O-s
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
                   |                                                                      |????????????|      25+GB of results for repatriation
                   |                                                                      |????????????|                        on localhost
                   |                                                                      |????????????|
                   |                                                                      |????????????|
                   |                                                                      |????????????|-> LAN  SLOW
                   |                                                                                   |   WAN  SLOWER
                   |                                                                                   |   transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
                   |                                                                                   |               |                     30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                                   |~~~~~~~~~~~~~~~|                                       6th time 25+GB file-I/O-s
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
 SUCCESS ?         |                                                                                   |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
    after          |
          how many |
          failed   |
          attempts |
   having          |
          how high |
          recurring|
          costs    |
          for any  |
              next |
              model|
          recompute|
          step(s)  |
                   |
                   |
All                |
that               |
( at what overall  |
    [TIME]-domain  |
   & "HPC"-rental  |
           costs ) |

Tips :

  • review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
  • precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
  • join the such "reduced" CSV-files, using a trivial O/S command, into a single input
  • compress all data before transports ( a few orders of magnitude saved )
  • prefer to code your computing using such algorithms' formulation, that can stream-process items along the data-flow, i.e. not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( like .mean(), .sum(), .min(), .max() or even .rsi(), .EMA(), .TEMA(), .BollingerBands() ... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )
  • if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart numpy-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )

For more details on going into the direction of RAM-protecting memory-mapped np.ndarray processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this :

>> print( np.memmap.__doc__ )
Create a memory-map to an array stored in a *binary* file on disk.

   Memory-mapped files are used for accessing small segments of large files
   on disk, without reading the entire file into memory.  NumPy's
   memmap's are array-like objects. (...)
user3666197
  • 1
  • 6
  • 50
  • 92