0

I am writing a Python 3 code where the task is to open about 550 files in a directory, read their contents and append it to a string variable 'all_text' which will be say around millions of line long as a single line.

The inefficient code I was using till now is as follows-

all_text += str(file_content)

But then I read that using 'join()' method is efficient, so I tried the following code-

all_text = ''.join(file_content)

The problem with this code is that this is removing the previously held contents of 'all_text' variable and writing the current file's content only!

How do I get around this problem?

Thanks for your help!

Arun
  • 2,222
  • 7
  • 43
  • 78

1 Answers1

0

join() has a definition str.join(iterable) where iterable is a generator or a list or a set and so on. So it is helpful if you already have a list of strings read from the files and you are concatenating them using join. For example

numList = ['1', '2', '3', '4']
seperator = ', '
print(seperator.join(numList))

numTuple = ('1', '2', '3', '4')
print(seperator.join(numTuple))

s1 = 'abc'
s2 = '123'

""" Each character of s2 is concatenated to the front of s1""" 
print('s1.join(s2):', s1.join(s2))

""" Each character of s1 is concatenated to the front of s2""" 
print('s2.join(s1):', s2.join(s1))

You can get all lines in a file using join like ''.join(readlines(f))

Now you can accomplish your task using join as follows using fileinput module

import fileinput
files= ['package-lock.json', 'sqldump.sql', 'felony.json', 'maindata.csv']
allfiles = fileinput.input(files)
all_text = ''.join(allfiles)

Refer to this answer to know the most efficient way to concat files into a string.

Suggestion: As you mentioned there would be millions of lines, did you consider the memory it is going to consume to store it in a variable? So it is better you do what you are planning to do on the fly while reading the lines instead of storing it in a variable.

  • I tried your code using 'fileinput' module but it gives me a MemoryError. My system has 8GB RAM and the combined size of the 550 files in the directory is around 2.3 GB. Any ideas how to avoid this? – Arun Nov 10 '18 at 14:56
  • Could be that your machine has not enough free RAM or your OS has limited Python to a certain amount of RAM only. Nevertheless, as I suggested you better do whatever you are planning to do with those files on-the-fly rather than adding them up because opening file consumes memory and the string again consumes memory. So you're using a lot of memory which is inefficient in the first place. – Mani Kumar Reddy Kancharla Nov 12 '18 at 09:45