How to upload several huge text files and concatenate them into a variable in google colab

Question

I try to read data from several text files in my google drive to google colab notebook by this following python code.

import os
import glob

# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

os.chdir("/content/drive/MyDrive/AMI_2000_customers")

extension = 'txt'
all_filenames = pd.Series([i for i in glob.glob('*.{}'.format(extension))])
searchfor = ['2020', '2021']
result = list(all_filenames[all_filenames.str.contains('|'.join(searchfor))])

After that, I try to combine them together by running the code below. Each raw data file contains my monthly customers data. Thus, time series data continuously is concerned for doing data preparation in the next step.

data = pd.concat([pd.read_csv(f , sep='\t',header=None) for f in result])

My raw data files which meet a "searchfor" condition are around 24 files or around 11.7 GB and look like this in google drive directory.

I face a high RAM consumption problem (almost reach to a maximum limit of available RAM) when I execute the above program and I do not have an adequate available RAM to do other next process in google colab (I subscribed a google colab pro and able to access to used Python 3 Google Compute Engine backend both GPU and TPU which provide a memory space up to 35 GB)

Do we have an appropriate way to complete my task with reasonable RAM usage and computation time consumption to avoid an available RAM problem?.

score 0 · Answer 1 · answered Feb 02 '23 at 05:35

0

Might something like this help? yield? Lazy Method for Reading Big File in Python?

Here is a reference to yield:https://realpython.com/introduction-to-python-generators/

answered Feb 02 '23 at 05:35

David A

53
5

How to upload several huge text files and concatenate them into a variable in google colab

1 Answers1