0

I have a csv file (VV_AL_3T3_P3.csv) and each of the rows of each csv file correspond to tiff images of plankton. It looks like this:

Particle_ID  Diameter  Image_File                   Lenght ....etc
          1     15.36  VV_AL_3T3_P3_R3_000001.tif    18.09
          2     17.39  VV_AL_3T3_P3_R3_000001.tif    19.86
          3     17.21  VV_AL_3T3_P3_R3_000001.tif    21.77
          4      9.42  VV_AL_3T3_P3_R3_000001.tif     9.83

The images were located all together in a folder and then classified by shape in folders. The name of the tiff images is formed by the Image_file + Particle ID; for example for the first row: VV_AL_3T3_P3_R3_000001_1.tiff

Now, I want to add a new column called 'Class' into the csv file that I already have (VV_AL_3T3_P3.csv) with the name of the folder where each .tiff file is located (the class) using python; like this:

Particle_ID  Diameter  Image_File                   Lenght   Class
          1     15.36  VV_AL_3T3_P3_R3_000001.tif    18.09   Spherical
          2     17.39  VV_AL_3T3_P3_R3_000001.tif    19.86   Elongated
          3     17.21  VV_AL_3T3_P3_R3_000001.tif    21.77   Pennates
          4      9.42  VV_AL_3T3_P3_R3_000001.tif     9.83   Others

So far, I have a list with the names of the folders where every tiff file is located. This is the list that will be the new column. However, how can I do to fit every folder with its row? In other words, matching the 'Class' with 'Particle ID' and 'Image file'.

For now:

## Load modules:
import os
import pandas as pd
import numpy as np
import cv2

## Function to recursively list files in dir by extension
def file_match(path,extension):
    cfiles = []
    for root, dirs, files in os.walk('./'):
        for file in files:
            if file.endswith(extension):
                cfiles.append(os.path.join(root, file))
    return cfiles


## Load all image file at all folders:
image_files = file_match(path='./',extension='.tiff')

## List of directories where each image was found:
img_dir = [os.path.dirname(one_img)[2:] for one_img in image_files]
len(img_dir)

## List of images:
# Image file column in csv files:
img_file = [os.path.basename(one_img)[:22] for one_img in image_files]
len(img_file)
# Particle id column in csv files:
part_id  = [os.path.basename(one_img)[23:][:-5] for one_img in image_files]
len(part_id)

## I have the information related with the collage picture, particle id and the classification folder.
# Now i need to create a loop where this information is merged...

## Load csv file:
data = pd.read_csv('VV_AL_3T3.csv')
sample_file = data['Image File']  # Column name
sample_id   = data['Particle ID'] # Particle ID

I have seen a similar case here: Create new column in dataframe with match values from other dataframe

but I don't really know how to use the 'map.set_index' and also, he has two data frames whereas I just have one.

Olga
  • 65
  • 2
  • 10

3 Answers3

0

For the first part of your question, use os.path.split

If your path was... /home/usuario/Desktop/Classification/Fraction_9to20um/Classes/test

os.path.split(path)[1]

would return test.

then in your for loop, append that to each row

for row in rows:
    row = row.append(os.path.split(path)[1]
    writer.writerow(row)

ref: https://docs.python.org/3/library/os.path.html

snewman0008
  • 103
  • 7
0

You can use os.path.split(path) to break a path into two parts: the beginning and the last piece, whether it's a file or a directory.

For example:

myPath = '/test/second/third/theFile.txt'
firstPair = os.path.split(myPath)
# firstPair == ('/test/second/third', 'theFile.txt')

If you have the full filepath and want the last directory name, run this command twice:

filePath = '/home/usuario/Desktop/Classification/Fraction_9to20um/Classes/ClassA/img_001.tiff'
firstPair = os.path.split(filePath)
secondPair = os.path.split(firstPair[0])
print(secondPair[1])
# ClassA
WhiteHotLoveTiger
  • 2,088
  • 3
  • 30
  • 41
  • I followed your indications, However I have 827 .tiff images. Is there a way to do it at once? – Olga Apr 23 '18 at 09:11
0

It sounds like my_files is a list of (paths+tiff_file_name). What you want is the last segment of the parent directory's absolute path, it seems.

So, /some/path/to/directory/classA/instance.tiff would be given to classA.

There are two approaches, with two slightly different interpretations

1) The second last part of the path is the class.

rows = [file.split(os.path.sep)[-2] for file in my_files]

2) The containing directory of the file, relative to the Classes directory, is the class.

rows = [ os.path.relpath( os.path.dirname(file), '/home/usuario/Desktop/Classification/Fraction_9to20um/Classes/' ) for file in my_files ]


EDIT (for clarification/sample): In order to write out classes with their files,

with open(output_path, "w") as f:
    writer = csv.writer(f)
    # optionally, write the header
    writer.writerow(['full_img_path', 'img_class'])
    for file in my_files:
        img_class = os.path.relpath(
            os.path.dirname(file),
            '/home/usuario/Desktop/Classification/Fraction_9to20um/Classes/'
        )
        writer.writerow([file, img_class])

It's not clear from your question if you want your output_path to be class.csv or VV_AL_3T3_P3.csv, but hopefully you see that it's easily interchangeable.

Note that the above pattern tends to be easy enough to implement/debug if there is a one-to-one correspondence between inputs and outputs (input -> simple transform -> output). But once you begin aggregating data (say, the average number of files per class), you might want to begin exploring a data manipulation library like pandas.

jagthebeetle
  • 705
  • 6
  • 11
  • You were right, my_files is a list of (paths+.tiff). I followed the second approch and now I have a list called rows with the different classes, the names of the folders containing the .tiff files. However, how do I convert that into a new column of my csv file, VV_AL_3T3_P3.csv? I want that every .tiff file goes with its folder. – Olga Apr 20 '18 at 14:29
  • See the edit. For writing, the csv.writer will usually have you write arrays of values out. So simply calculate all your values per row, and write them out as an array. – jagthebeetle Apr 23 '18 at 11:09
  • I edited the question, I hope now is clearer. I obtained two columns, one with the path and another with the belonging class but they didn't fit with the rest of the column like Image_File, Particle ID, Diameter...etc. I am so sorry if I am disturbing you, I really apreciate your help and it is helping me to learn more about Python. – Olga Apr 23 '18 at 15:10