2

I have like 1000s of files in more than 100s of folders. I need to write one of the folder's name into the file as one of the column.

Directory Structure:

Data -> 000 -> Trajectory -> set of files
Data -> 001 -> Trajectory -> set of files
Data -> 002 -> Trajectory -> set of files
Data -> 003 -> Trajectory -> set of files
.        .        .
.        .        .
.        .        .
Data -> nnn -> Trajectory -> set of files

Every Trajectory folder has more than 100s of files and every file has following columns. Every file has an extension .plt

39.984702,116.318417,0,492,39744.1201851852,2008-10-23,02:53:04
39.984683,116.31845,0,492,39744.1202546296,2008-10-23,02:53:10
39.984686,116.318417,0,492,39744.1203125,2008-10-23,02:53:15
39.984688,116.318385,0,492,39744.1203703704,2008-10-23,02:53:20
39.984655,116.318263,0,492,39744.1204282407,2008-10-23,02:53:25
39.984611,116.318026,0,493,39744.1204861111,2008-10-23,02:53:30

What I am trying to get it put the folder name as one of the column names.

Expected output: for the files in folder with name 000

000 39.984702,116.318417,0,492,39744.1201851852,2008-10-23,02:53:04
000 39.984683,116.31845,0,492,39744.1202546296,2008-10-23,02:53:10
000 39.984686,116.318417,0,492,39744.1203125,2008-10-23,02:53:15
000 39.984688,116.318385,0,492,39744.1203703704,2008-10-23,02:53:20
000 39.984655,116.318263,0,492,39744.1204282407,2008-10-23,02:53:25
000 39.984611,116.318026,0,493,39744.1204861111,2008-10-23,02:53:30

I could not find any near by sample to work around with. Any suggestion will be helpful.

Edit 1: As suggested by @EdChum about using glob But that only allows me to find files with given extension. But my problem here is something else.

In more simple words

rootdir -> subdir_1 -> subdir_2 -> files

Include the name of subdir_1 as col[0] in all the files present in subdir_2 along with other columns. The files can be appended no need to create a new output file.

Community
  • 1
  • 1
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • 1
    Please provide your attempt at the problem – Alan Kavanagh May 17 '16 at 11:17
  • To be frank I have been searching on how to even start with but I have not seen a single such example that will read the directory name and put in file as one of the columns :( Hence the question is without any code attempt. – Sitz Blogz May 17 '16 at 11:19
  • 1
    You want to use [glob](http://stackoverflow.com/questions/2186525/use-a-glob-to-find-files-recursively-in-python) and then parse the path to get the folder name and after loading each file add a new column `df['folder_name'] = folder_name`, please have a try – EdChum May 17 '16 at 11:21

1 Answers1

1
  • The first block of code will get all the files which end with .plt
  • Next, we check if your subdir_1 is actually only consists of digits and is characters long (just some sanity check to make sure that we don't hit all files which end with .plt) and if the plt file is in a trajectory folder.
  • Finally, a new file is opened which has the same name as the original file, but .new is appended. Each line from the old file is read, a new column with the directory name is added at the beginning and the new line is written to the output file.


import os

#get all plt files
traj_files = []
for root, dirs, files in os.walk('Data'):
    for filename in files:
        if filename.endswith('.plt'):
            traj_files.append(os.path.join(root, filename))

for traj_file in traj_files:

    #the new column we want to write
    new_col = traj_file.split('/')[1]
    #check if filename looks OK
    if len(new_col) != 3 or not new_col.isnumeric() or not '/Trajectory/' in traj_file:
        continue

    #read old file and write new column
    with open(traj_file + '.new', 'w') as new_traj:
        with open(traj_file, 'r') as old_traj:
            for line in old_traj.readlines():
                new_traj.write(new_col + ' ' + line)

There are certainly more flexible and elegant approaches but this should work for your particular directory structure.

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • Thank you ! let me check this immediately and reply you – Sitz Blogz May 17 '16 at 14:17
  • you'll need Python3 and the script has to be started in the same directory where Data is located, otherwise it won't work. – Maximilian Peters May 17 '16 at 14:19
  • The Directory part goes unsaid, but I am using python 2.7 and I am getting this error `Traceback (most recent call last): File "dict_name_file.py", line 15, in if len(new_col) != 3 or not new_col.isnumeric() or not '/Trajectory/' in traj_file: AttributeError: 'str' object has no attribute 'isnumeric'` – Sitz Blogz May 17 '16 at 14:22
  • 1
    change `new_col = traj_file.split('/')[1]` to `new_col = unicode(traj_file.split('/')[1])` or remove the filename check completely – Maximilian Peters May 17 '16 at 14:24
  • Great ! This works like charm. But few changes can we change the output destination? And the new col is space separated and not csv format. If that could also be changed while writing. We can replace the new extension to `csv`. – Sitz Blogz May 17 '16 at 14:30
  • 1
    Of course, you can change the output format as you wish, just replace the space in the ´write´ call with a tab and the filename for ´new_traj´ with the location you need. – Maximilian Peters May 17 '16 at 14:32