0

I need to write coding for reading all the text files in the subfolders and count the frequency of some words in each text file. Below is my coding:

def process_files(dir):
    # List of filenames in the directory
    
    filenames = []
    for root, _, files in walk(dir):
        filenames.extend([join(root, file) for file in filter(files, '*.txt')])
    
    df = pd.DataFrame()

    for filename in filenames:
        virtual_currency, bitcoin,blockchain, cryptocurrency, digital_currency, litecoin, dogecoin, etherrum =0, 0, 0, 0, 0, 0, 0,0

        with open(filename, 'r') as f:
            contents = f.read().lower()
            
        # Process the content
            cryptocurrency=contents.count("cryptocurrenc")
            virtual_currency = contents.count("virtual currenc")
            digital_currency=contents.count("digital currenc")
            bitcoin = contents.count("bitcoin")
            blockchain=contents.count("blockchain")
            litecoin=contents.count("litecoin")
            dogecoin=contents.count("dogecoin")
            etherrum=contents.count("etherrum")
        
        # Create data row
            data = {'File Name': filename,
            'Virtual Currency': virtual_currency,
            'Bitcoin': bitcoin,
            'cryptocurrency': cryptocurrency,
            'digital currency': digital_currency,
            'litecoin': litecoin,
            'dogecoin': dogecoin,
            'etherrum': etherrum}

        #df.append(data,ignore_index=True)
        
            a=pd.DataFrame(data, index=[0])
            df=df.append(a,ignore_index=True, sort=False)
        
        # Save data to CSV
        df.to_csv(os.path.join(dir, filename), index=False)

    return df

result = process_files(r'C:\test\QTR2')

The issue is, the generated result is all the same for each document. My guess is that the last result rewrites the old values. enter image description here

Could someone please help check which part went wrong? I really appreciate any help you can provide.

Barmar
  • 741,623
  • 53
  • 500
  • 612
Ciercy
  • 51
  • 5
  • [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Jul 16 '23 at 03:42
  • `filename` is an absolute pathname, why are you joining it with another directory? – Barmar Jul 16 '23 at 03:43
  • You're writing the dataframe to a different file each time through the loop, but it's the concatenation with all the previous dataframes. So only the last file will have everything in it, the previous ones will be incremental versions. If you want to combine everything, write the CSV once at the end of the loop. – Barmar Jul 16 '23 at 03:45
  • thanks for the comment, but I have no idea what function can I use to list all the results in one doc, either text or csv. Any suggestion? – Ciercy Jul 16 '23 at 04:18
  • Take the `to_csv` call out of the for loop and append it at the end, calling it with `result`. Then there will be one `.csv` written. – Jan Jul 16 '23 at 04:44

2 Answers2

0

If you are trying to save all files into a single processed file, you have indentation problem

# Save data to CSV
<---df.to_csv(os.path.join(dir, filename), index=False)

And if you are trying to write all the loaded file processed and saved into file replacing, you have initialization problem.

Replace

df = pd.DataFrame()
for filename in filenames:

with below

for filename in filenames:
   df = pd.DataFrame()
user306
  • 414
  • 3
  • 13
0

The issue is the "to_csv" function, in which the coding rewrites the original text file. Thus, the coding cannot give me the correct result. So I drop the line and use the original dataset, and the coding works well.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Ciercy
  • 51
  • 5
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 20 '23 at 19:12