1

I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.

My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.

Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.

I was wondering if I could get tips on how to make my code more efficient.

for opening and reading files, I use:

filenames = os.listdir('.')
dict = {}
for file in filenames:
    with open(file) as f:
        contents = f.read()
        dict[file.replace(".txt", "")] = contents

Doing print(dict) crashes (at least it seems like it) my python. Is there a better way to handle this?

Additionally, I also convert all the values in my dict to lowercase, using:

def lower_dict(d):
   lcase_dict = dict((k, v.lower()) for k, v in d.items())
   return lcase_dict
lower = lower_dict(dict)

I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?

Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?

however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different. Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.

Any help is very much appreciated.

thanks

HonsTh
  • 65
  • 7
  • How do you change the format of such txt before saving it to csv? Is there any reason for keeping txt contents in a dictionary? – crayxt Sep 05 '19 at 03:25
  • for the text files, I count specific words and add that to my dictionary as a separate key-value pair. Keys = word1, word2, ..., value = number of times each word appears. Once my dictionary is populated, I save it as a dataframe in pd and export to csv - is this inefficient? should I be working entirely out of a dataframe? – HonsTh Sep 05 '19 at 03:28

1 Answers1

1

I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:

for line in lines:
    line = re.sub('[T:-]', '', line)

Let me know if this helps!

Shaun Lowis
  • 283
  • 2
  • 18
  • 1
    shaun - I just got around to trying your suggestion, and can confirm your suspicions were correct - this was the problem! Thank you so much for that, you saved me a lot of time! I thought I would have had to do a ground-up rebuild of my code – HonsTh Sep 05 '19 at 20:28