Tips for working with large quantity .txt files (and overall large size) - python?

Question

I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.

My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.

Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.

I was wondering if I could get tips on how to make my code more efficient.

for opening and reading files, I use:

filenames = os.listdir('.')
dict = {}
for file in filenames:
    with open(file) as f:
        contents = f.read()
        dict[file.replace(".txt", "")] = contents

Doing print(dict) crashes (at least it seems like it) my python. Is there a better way to handle this?

Additionally, I also convert all the values in my dict to lowercase, using:

def lower_dict(d):
   lcase_dict = dict((k, v.lower()) for k, v in d.items())
   return lcase_dict
lower = lower_dict(dict)

I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?

Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?

however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different. Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.

Any help is very much appreciated.

thanks

How do you change the format of such txt before saving it to csv? Is there any reason for keeping txt contents in a dictionary? — crayxt, Sep 05 '19 at 03:25
for the text files, I count specific words and add that to my dictionary as a separate key-value pair. Keys = word1, word2, ..., value = number of times each word appears. Once my dictionary is populated, I save it as a dataframe in pd and export to csv - is this inefficient? should I be working entirely out of a dataframe? — HonsTh, Sep 05 '19 at 03:28

score 1 · Accepted Answer · answered Sep 05 '19 at 07:33

1

I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:

for line in lines:
    line = re.sub('[T:-]', '', line)

Let me know if this helps!

answered Sep 05 '19 at 07:33

Shaun Lowis

283
2
18

1

shaun - I just got around to trying your suggestion, and can confirm your suspicions were correct - this was the problem! Thank you so much for that, you saved me a lot of time! I thought I would have had to do a ground-up rebuild of my code – HonsTh Sep 05 '19 at 20:28

Tips for working with large quantity .txt files (and overall large size) - python?

1 Answers1

Linked