Loop function generated same results for each file

Question

I need to write coding for reading all the text files in the subfolders and count the frequency of some words in each text file. Below is my coding:

def process_files(dir):
    # List of filenames in the directory
    
    filenames = []
    for root, _, files in walk(dir):
        filenames.extend([join(root, file) for file in filter(files, '*.txt')])
    
    df = pd.DataFrame()

    for filename in filenames:
        virtual_currency, bitcoin,blockchain, cryptocurrency, digital_currency, litecoin, dogecoin, etherrum =0, 0, 0, 0, 0, 0, 0,0

        with open(filename, 'r') as f:
            contents = f.read().lower()
            
        # Process the content
            cryptocurrency=contents.count("cryptocurrenc")
            virtual_currency = contents.count("virtual currenc")
            digital_currency=contents.count("digital currenc")
            bitcoin = contents.count("bitcoin")
            blockchain=contents.count("blockchain")
            litecoin=contents.count("litecoin")
            dogecoin=contents.count("dogecoin")
            etherrum=contents.count("etherrum")
        
        # Create data row
            data = {'File Name': filename,
            'Virtual Currency': virtual_currency,
            'Bitcoin': bitcoin,
            'cryptocurrency': cryptocurrency,
            'digital currency': digital_currency,
            'litecoin': litecoin,
            'dogecoin': dogecoin,
            'etherrum': etherrum}

        #df.append(data,ignore_index=True)
        
            a=pd.DataFrame(data, index=[0])
            df=df.append(a,ignore_index=True, sort=False)
        
        # Save data to CSV
        df.to_csv(os.path.join(dir, filename), index=False)

    return df

result = process_files(r'C:\test\QTR2')

The issue is, the generated result is all the same for each document. My guess is that the last result rewrites the old values.

Could someone please help check which part went wrong? I really appreciate any help you can provide.

[Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) — Parfait, Jul 16 '23 at 03:42
`filename` is an absolute pathname, why are you joining it with another directory? — Barmar, Jul 16 '23 at 03:43
You're writing the dataframe to a different file each time through the loop, but it's the concatenation with all the previous dataframes. So only the last file will have everything in it, the previous ones will be incremental versions. If you want to combine everything, write the CSV once at the end of the loop. — Barmar, Jul 16 '23 at 03:45
thanks for the comment, but I have no idea what function can I use to list all the results in one doc, either text or csv. Any suggestion? — Ciercy, Jul 16 '23 at 04:18
Take the `to_csv` call out of the for loop and append it at the end, calling it with `result`. Then there will be one `.csv` written. — Jan, Jul 16 '23 at 04:44

user306 · Answer 1 · 2023-07-16T04:55:45.573

0

If you are trying to save all files into a single processed file, you have indentation problem

# Save data to CSV
<---df.to_csv(os.path.join(dir, filename), index=False)

And if you are trying to write all the loaded file processed and saved into file replacing, you have initialization problem.

Replace

df = pd.DataFrame()
for filename in filenames:

with below

for filename in filenames:
   df = pd.DataFrame()

edited Jul 16 '23 at 04:55

answered Jul 16 '23 at 04:48

user306

414
3
13

I think the to_csv is not the issue. Even after only checking the result and deleting this line, the generated result is still the same. – Ciercy Jul 16 '23 at 04:50
check the edited answer – user306 Jul 16 '23 at 04:56
Thank you, but the updated coding can only give me the result of one file – Ciercy Jul 16 '23 at 05:01

score 0 · Answer 2 · edited Jul 19 '23 at 02:26

0

The issue is the "to_csv" function, in which the coding rewrites the original text file. Thus, the coding cannot give me the correct result. So I drop the line and use the original dataset, and the coding works well.

edited Jul 19 '23 at 02:26

Dharman

30,962
25
85
135

answered Jul 18 '23 at 03:35

Ciercy

51
5

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 20 '23 at 19:12

Loop function generated same results for each file

2 Answers2