-1

I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...) I read out the names of the folders in the directory and store them in a variable called foldernames.

Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles" These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns). Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file. In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file. So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names). At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.

import os
import glob

foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
    if foldernames.startswith("scan"):
        # print(foldernames)
        foldernames1.append(foldernames)
        

for i in range(1, len(foldernames1)):
    workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
    os.chdir(workingpath)
    myFiles = glob.glob('*.txt')
    column_names = ['X','Y']+myFiles[1:len(myFiles)]
    
    
    files = [open(f) for f in glob.glob('*.txt')]  
    fout = open ("ResultsCombined.txt", 'w')
    
    for row in range(1, 3371): #len(files)):

        for f in files:
            fout.write(f.readline().strip().split('\t')[2])
            fout.write('\t')
        fout.write('\t')
     
    
    fout.close()

As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:

import os
import glob
import csv

foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
    if foldernames.startswith("scan"):
        # print(foldernames)
        foldernames1.append(foldernames)
        

for i in range(1, len(foldernames1)):
    workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
    os.chdir(workingpath)
    myFiles = glob.glob('*.txt')
    column_names = ['X','Y']+myFiles[0:len(myFiles)]
    # print(column_names)
    
    with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
        writer = csv.DictWriter(target, fieldnames=column_names)
        writer.writeheader() # if you want a header
        
        for path in glob.glob('*.txt'):
            with open(path, newline='') as source:
                reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
                writer.writerows(reader)

Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?

Many thanks and best regards, quester

quester
  • 1
  • 1
  • https://stackoverflow.com/questions/13613336/how-do-i-concatenate-text-files-in-python might help - all it requires you to do is to get the names of the files and you are all set – Larry the Llama Nov 08 '21 at 09:11

1 Answers1

0

pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:

import pandas as pd
from pathlib import Path

p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")]  # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
    # use tab separator, read only 3rd column, name the column, read as floats
    current = pd.read_csv(path, 
                          sep="\t", 
                          usecols=[2], 
                          names=[path.name], 
                          dtype="float64")
    # add header=0 to pd.read_csv if there's a header row in the .txt files
    pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
    

Hope this helps!

  • Many thanks, this already helps me a lot. i will try to implement it. I just forgot to mention one key aspect: is it possible to start skip the first row, I mean copy the third column starting at row 2 in the txt files? – quester Nov 08 '21 at 10:12
  • You can add an argument to the read_csv function, e.g., skiprows=1. This will cause pandas to ignore the first row in the file. Are you trying to ignore the first row of data or the header row? – Yoav Tulpan Nov 08 '21 at 10:26
  • Ok, many thanks. Sorry, I wasn't aware how the header is defined. I will simply implement your suggested header=0 line! The problem I face now with the code is that the csv files are created but they are empty. – quester Nov 08 '21 at 10:54
  • I would try loading a single txt file with the read_csv command to see what happens. It could be that the files aren't exactly formatted as you think they are. E.g., df = pd.read_csv("example.txt", sep="\t") print(df.head()) – Yoav Tulpan Nov 08 '21 at 11:39
  • Many thanks. I implemented your hint and now I have one txt file with all three columns in my csv. Due to your help I'm getting closer; really appreciate that! Now I have to fix that the third column's of the other txt files are copied to the next columns in the csv file – quester Nov 08 '21 at 12:28