I'm working with a huge number of files (~8000) and storing them into a dictionary, manipulating the values and analysing them, converting to a pandas dataframe, then outputting into a csv.
This question is my attempt to solve my issue here:Tips for working with large quantity .txt files (and overall large size) - python?
The code is fine for the first ~500 files or so, but crashes my computer/python when I use my full sample.
My code structure looks like this:
# For-loop 1
for file in filenames:
#do stuff
with open(file) as f:
# do more stuff
# For-loop 2
for k, v in dict():
#do stuff
dict3[k] = dict(Counter(new))
# convert dictionary to dataframe using pandas.
df = pd.DataFrame.from_dict(dict3, orient='index').fillna(0).astype(int)
# export dataframe to excel.
df.to_csv(r'path\example.csv',index = True, header=True)
My question is this:
If I break the first for-loop after the first 500 files using:
if file == "500":
break
Is there a way to adjust the code so that after it's run through the script, it returns to the first for-loop and iterates from file 501-1001, until I've cycled through 8000 files?
Additionally, I'd want the excel output to be appended from the last row to include the new set of files being iterated over, instead of being over-written entirely.
If my solution seems jagged, I'd love to get some feedback on where to take this, as I'm still very new to python.
Thanks!
Edit: Elaborating on what I'm trying to do with my data
Goal: I have thousands of .txt files that I want to count key-words in, and output these counts into a csv.
This is my process:
open and read the .txt files, and store into a dictionary as such:
dict1 =
{'file1': 'string for all contents in file', 'file2': 'string for all contents in file', ... 'file_last': 'string for all contents in file'}Now I want to convert all the values of this dict to lower-case. I use a user-defined function called
lower_dict
to getdict2 = lower_dict(dict1)
Now I define a list with words I want to count for in my
dict2
. filter_Words = ["word1", "word2", ... , "word_last"]for k, v in dict2.items():
I count the occurence of each word in each file, and store into a new dict,dict3
dict3 = {'file1': {'word1': 5, 'word2: 3'}, 'file2': {'word1': 12, 'word2: 0'}}
I export this to pandas dataframe:
I export the dataframe to a csv, rows are filenames, columns are
word1, word2,...
with entries being the number of times those words appear in each file.