0

I have 120 txt files, all are around 150mb in size and have thousands of columns. Overall theres definitely more than 1million columns. When I try to concatenate using pandas I get this error: " Unable to allocate 36.4 MiB for an array with shape (57, 83626) and data type object"... I've tried Jupyter notebook and Spyder, neither work

How can I join the data? Or is this data not suitable for Pandas.

Thanks!

Alex B
  • 21
  • 5
  • 2
    Looks like you're running out of system memory. Without knowing more about why you need to load this data, I have no basis to determine whether pandas is the correct tool – inspectorG4dget Sep 30 '20 at 14:25
  • May be adjusting jupyter notebook default memory limit can help you, check out this [answer](https://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit) – Onur Guven Sep 30 '20 at 14:32

2 Answers2

0

You are running out of memory. Even if you manage to load all of them (with pandas or other package), your system will still run out of memory for every task you want to perform with this data.

Assuming that you want to perform different operations in different columns of all the tables, the best way to do so is to perform each task separately, preferrably batching your columns since there are more than 1k for each file, as you say.

Let's say you want to sum the values in the first column of each file (assuming they are numbers...) and store these results in a list:

import glob
import pandas as pd
import numpy as np

filelist = glob.glob('*.txt') # Make sure you're working in the directory containing the files

sum_first_columns = []

for file in filelist:
    df = pd.read_csv(file,sep=' ') # Adjust the separator for your case
    sum_temp = np.sum(df.iloc[:,0])
    sum_first_columns.append(sum_temp)

You now have a list of dimension (1,120).

For each operation, this is what I would do if it was mandatory for me to work with my own computer/system.

Please note that this process will be very time consuming as well, given the size of your files. You can either try to reduce your data or to use a cloud server to compute everything.

8783645
  • 64
  • 4
  • How can I use a cloud server? Is it possible to use SharePoint rather than my laptop memory? – Alex B Sep 30 '20 at 15:19
  • @AlexB, You have to pay for the server to run processes. AWS is a common choice. I'm not familiar with SharePoint, so cannot help with this matter. – 8783645 Sep 30 '20 at 15:21
0

Saying you want to concat in pandas implies that you just want to merge all 150 files together into one file? If so you can iterate through all the files in a directory and read them in as lists of tuples or something like that and just combine them all into one list. Lists and tuples are magnitudes less memory than dataframes, but you won't be able to perform calculations and stuff unless you throw them in as a numpy array or dataframe.

At a certain point, when there is too much data it is appropriate to shift from pandas to spark since spark can use the power and memory from a cluster instead of being restricted to your local machine or servers resources.

juppys
  • 140
  • 5