1

I want to read folders' names from tar.gz file, and create column that contains the names.

I'm using this code:

file_path = r"C:\Users\filename.tar.gz"
start_with = './mainfolder/'

import tarfile
import re
with tarfile.open(file_path, "r:*") as tar:
    csv_path = tar.getnames()
    csv_path = list(n for n in tar.getnames() if (n.endswith('.csv')) & (n.startswith(start_with)))
    df = pd.DataFrame()

    csv_list = []

    for file in csv_path:
        df_temp = pd.read_csv(tar.extractfile(file))
        csv_list.append(df_temp)

    df = pd.concat(csv_list)

In the main folder there are few folders that have names. After reading a csv file from folder "X" (for example), "FolderName" column should be created in this csv file and has to contain the name of the folder ("X") for all the rows. And so for every csv file.

An exmaple for path string: ./mainfolder/1001_name or ./mainfolder/1002_some_name

qwerty
  • 889
  • 6
  • 16

1 Answers1

1

After the following line:

df_temp = pd.read_csv(tar.extractfile(file))

You can get the folder name from file path string using os.path.dirname() method. More info here.

You'll need to import os module.

Example:

#returns ./mainfolder/1001_name
full_folder_path = os.path.dirname(file)

#returns 1001_name
folder = os.path.basename(full_folder_path)

#returns name bit
result = folder[folder.index('_')+1:]

df_temp['FolderName'] = result

This create a new column called FolderName and set the value for all rows. More info here.

Rithin Chalumuri
  • 1,739
  • 7
  • 19