0

So I have around 65,000 jpg images of cars, each filename has information about the car. For example:

Acura_ILX_2013_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_aWg.jpg

'Displacement', 'Engine Type', 'Width, Max w/o mirrors (in)', 'Height, Overall (in)',
'Length, Overall (in)', 'Gas Mileage', 'Drivetrain', 'Passenger Capacity', 'Passenger Doors',
'Body Style' 'unique identifier'

Because there are different images of the same car, a unique 3 letter identifier is used at the end of each file.

I have created a data frame from the file names using the following code:

car_file = os.listdir(r"dir")

make = []
model = []
year = []
msrp = []
front_wheel_size = []
sae_net_hp = []
displacement = []
engine_type = []
width = []
height = []
length = []
mpg = []
drivetrain = []
passenger_capacity = []
doors = []
body_style = []
for i in car_file:
    make.append(i.split("_")[0])
    model.append(i.split("_")[1])
    year.append(i.split("_")[2])
    msrp.append(i.split("_")[3])
    front_wheel_size.append(i.split("_")[4])
    sae_net_hp.append(i.split("_")[5])
    displacement.append(i.split("_")[6])
    engine_type.append(i.split("_")[7])
    width.append(i.split("_")[8])
    height.append(i.split("_")[9])
    length.append(i.split("_")[10])
    mpg.append(i.split("_")[11])
    drivetrain.append(i.split("_")[12])
    passenger_capacity.append(i.split("_")[13])
    doors.append(i.split("_")[14])
    body_style.append(i.split("_")[15])
df = pd.DataFrame([make,model,year,msrp,front_wheel_size,sae_net_hp,displacement,engine_type,width,height,length,mpg,drivetrain,passenger_capacity,doors,body_style]).T   

(It is not the cleanest way to do this I presume)

My question is, how I can most efficiently include the jpg image in the dataset maybe as an additional column at the end.

Nish
  • 63
  • 5
  • Why do you want to add `.jpg` objects in a dataframe? What are you trying to achieve with this? If you could explain the bigger picture we could suggest an alternate approach – The Singularity Oct 06 '21 at 11:07
  • @Luke sure, so I'm actually trying to create a dataset of these vehicles which includes some sort of visual data (the images) and the vehicle specification information in one place for now. At a later point, the images will be used for further work, this is just preliminary at the moment. Although it is important that the images of each car are in line with their specification data. – Nish Oct 06 '21 at 11:13
  • _further work_? For a Machine Learning Task? Can you elaborate? – The Singularity Oct 06 '21 at 11:14
  • It is not entirely my project, so the details are a bit vague at the moment. An example is to develop a model in which we can estimate the impacts of the visual features of cars (shapes and design) upon consumer demand for that vehicle, or other economic questions that may be of value. – Nish Oct 06 '21 at 11:22
  • So this `.jpg` dataframe is for a Machine Learning application? – The Singularity Oct 06 '21 at 11:23
  • Yes it would be for machine learning – Nish Oct 06 '21 at 11:24
  • 1
    Using a pandas dataframe containing images for ML is not good practice. – The Singularity Oct 06 '21 at 11:27
  • 2
    I'd recommend you store them in directories or sub-directories or other formats like the `.HDF5`, Consider doing adequate research to avoid bad practices in ML. – The Singularity Oct 06 '21 at 11:29

1 Answers1

1

I am not really sure if you actually WANT to open all 65'000 images at once, as this may occupy huge amounts of memory. I'd recommend simply saving the path to the image in the DataFrame.

If you really want to open it, see: How to read images into a script?

But to clean up your original code: I did something similar a while back and I solved it via regex. That might be overdoing it though. But you can use split directly to put your values into rows instead of building columns. Both ideas in the example below (might contain errors).

from pathlib import Path
import re
import pandas as pd
import matplotlib.image as mpimg
from typing import Iterable, List


FILEPARTS = [
    "make", "model", "year", "msrp", "front_wheel_size", 
    "sae_net_hp", "displacement", "engine_type",              
    "width", "height", "length", "mpg",
    "drivetrain", "passenger_capacity", 
    "doors", "body_style", "id"
]


def via_regex(path_to_folder: str) -> pd.DataFrame:
    """ Matches filenames via regex. 
    This way you would skip all files in the folder that are not
    .jpg and also don't match your pattern."""
    folder = Path(path_to_folder)
    
    # select only .jpg files
    files = folder.glob('*.jpg')
    
    matches = filename_matcher(files)
    
    # build DataFrame
    df = pd.DataFrame(m.groupdict() for m in matches)
    df["File"] = [folder / m.string for m in matches]
    df["Image"] = [mpimg.imread(f) for f in df["File"].to_numpy()]
    return df


def filename_matcher(files: Iterable) -> List:
    """Match the desired pattern to the filename, i.e. extracts the data from 
    the filename into a match object. More flexible and via regex you
    could also separate numbers from units or similar."""
    # create regex pattern that groups the parts between underscores
    pattern = "_".join(f"(?P<{name}>[^_]+)" for name in FILEPARTS)
    pattern = re.compile(pattern)
    
    # match the pattern
    matches = (pattern.match(f.name) for f in files)
    return [match for match in matches if match is not None]


def via_split(path_to_folder: str) -> pd.DataFrame:
    """ Assumes all .jpg files have the right naming."""
    folder = Path(path_to_folder)
    
    # select only .jpg files
    files = folder.glob('*.jpg')
    
    # build DataFrame
    df = pd.DataFrame(columns=FILEPARTS + ["File", "Image"], index=range(len(files)))
    for idx, f in enumerate(files):
        df.loc[idx, FILEPARTS] = f.stem.split('_')
        df.loc[idx, "File"] = f
        df.loc[idx, "Image"] = mpimg.imread(f)
    return df


if __name__ == '__main__':
    df_re = via_regex('dir')
    df_split = via_split('dir')
Joschua
  • 324
  • 3
  • 8
  • That actually makes a lot of sense - memory didn't even cross my mind. Thank you for amending my code, it looks a bit alien at the moment but I'll take a look. – Nish Oct 06 '21 at 13:55
  • 1
    The `via_regex` part is maybe a bit overkill. My original problem included filenames like `59mm_10N_2500TeV` so it made sense to use regex to split numbers and units from each other. I only included it, as it is more versatile than the second solution, and maybe helpful for someone finding this thread having more complicated filenames. To replicate what you were doing, but in a more streamlined way, just look at the `via_split` function. Don't hesitate to ask me if you have questions about the code. – Joschua Oct 06 '21 at 15:47