-1

I'm working with a huge number of files (~8000) and storing them into a dictionary, manipulating the values and analysing them, converting to a pandas dataframe, then outputting into a csv.

This question is my attempt to solve my issue here:Tips for working with large quantity .txt files (and overall large size) - python?

The code is fine for the first ~500 files or so, but crashes my computer/python when I use my full sample.

My code structure looks like this:

# For-loop 1
for file in filenames:
    #do stuff
    with open(file) as f:
        # do more stuff

# For-loop 2
for k, v in dict():
    #do stuff
    dict3[k] = dict(Counter(new))

# convert dictionary to dataframe using pandas.
df = pd.DataFrame.from_dict(dict3, orient='index').fillna(0).astype(int)

# export dataframe to excel.
df.to_csv(r'path\example.csv',index = True, header=True)

My question is this:

If I break the first for-loop after the first 500 files using:

if file == "500":
        break

Is there a way to adjust the code so that after it's run through the script, it returns to the first for-loop and iterates from file 501-1001, until I've cycled through 8000 files?

Additionally, I'd want the excel output to be appended from the last row to include the new set of files being iterated over, instead of being over-written entirely.

If my solution seems jagged, I'd love to get some feedback on where to take this, as I'm still very new to python.

Thanks!

Edit: Elaborating on what I'm trying to do with my data

Goal: I have thousands of .txt files that I want to count key-words in, and output these counts into a csv.

This is my process:

  1. open and read the .txt files, and store into a dictionary as such: dict1 ={'file1': 'string for all contents in file', 'file2': 'string for all contents in file', ... 'file_last': 'string for all contents in file'}

  2. Now I want to convert all the values of this dict to lower-case. I use a user-defined function called lower_dict to get dict2 = lower_dict(dict1)

  3. Now I define a list with words I want to count for in my dict2. filter_Words = ["word1", "word2", ... , "word_last"]

  4. for k, v in dict2.items(): I count the occurence of each word in each file, and store into a new dict, dict3

dict3 = {'file1': {'word1': 5, 'word2: 3'}, 'file2': {'word1': 12, 'word2: 0'}}

  1. I export this to pandas dataframe:

  2. I export the dataframe to a csv, rows are filenames, columns are word1, word2,... with entries being the number of times those words appear in each file.

HonsTh
  • 65
  • 7
  • use `continue`, https://docs.python.org/3/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops – Manualmsdos Sep 05 '19 at 06:38

1 Answers1

1

I don't know that it's necessary for you to store your whole filebase as a dictionary. Reading through some of your various posts, it sounds like you have 50 GB of files you're iterating through.

Perhaps this answer will lead you in the right direction: "Large data" work flows using pandas

I think the solution to your problem will really boil down to precisely what you are trying to do to build a custom solution. So perhaps you could outline in your question precisely what operations you're performing on your data. This will probably be necessary for a custom recommendation for your dataset.

Hofbr
  • 868
  • 9
  • 31
  • Thanks for the suggestion, I'll have a read of that link. I've added a more precise summary in my edit in terms of what my script does. – HonsTh Sep 05 '19 at 07:31