Read multiple files in python and combine filenames and content into a dataframe

Question

I have the following lists in python created by reading files

files_list = ["A", "B", "C", "D"]

The contents of the files are character vectors as follows

A = ["A1"]
B = ["A2", "B1"]
C = ["A3", "B3", "C3", "C3"]
D = []

I would like to create the following dataframe

Col1   Col2
A      A1
B      A2, B1
C      A3, B3, C3
D

The filenames should be rendered as one column and the second column should contain the content of the files as a single line.

I tried the following code using a for loop. Note that this is a toy dataset and my dataset is a bit larger

import pandas as pd


df3 = pd.DataFrame()
for i in list_name:
    for j in i:
        df3["Col1"] = j
        df3["Col2"] = i

How do i accomplish the same using the for loop I request someone to take a look. The df3 object i generated was empty

Can you look into this and see if this addresses your [question](https://stackoverflow.com/questions/20908018/import-multiple-excel-files-into-python-pandas-and-concatenate-them-into-one-dat) — Joe Ferndz, Sep 08 '20 at 05:51
You can also look into this for [reading files](https://stackoverflow.com/questions/46224610/how-to-read-multiple-txt-file-from-a-single-folder-in-python) or [this](https://stackoverflow.com/questions/57111243/how-to-read-multiple-text-files-in-a-folder-with-python) and please make sure your question is not a [duplicate](https://stackoverflow.com/help/duplicates) — Joe Ferndz, Sep 08 '20 at 05:53

Adirio · Accepted Answer · 2020-09-08T06:20:54.667

import pandas as pd


files_list = ["A", "B", "C", "D"]
files_cont = [
    ["A1"],
    ["A2", "B1"],
    ["A3", "B3", "C3", "C3"],
    [],
]

df3 = pd.DataFrame({"contents": list(map(sorted, map(set, files_cont)))}, index=files_list)
print(df3)

       contents
A          [A1]
B      [A2, B1]
C  [A3, B3, C3]
D            []

We create a new pd.DataFrame using a dict so that the key is used for the column name (I used "contents" but choose whatever you feel like) and providing the index keyword argument to specify the rows.

As the question removed duplicates in the list, each content list is passed first to the set function to eliminate duplicated elements, then to the sorted function to get back a list with sorted elements. If you dont need that just use {"contents": files_cont} instead.

score 2 · Answer 2 · answered Sep 08 '20 at 07:03

Suppose your files are CSVs you can do the following to use the for loop:

import glob
import pandas as pd
directory = "C:/your/path/to/all/files/*.csv"
df3 = pd.DataFrame(columns=["col", "contents"])

for file in glob.glob(directory):
        col = file.split(sep="\\")[1].split(".")[0]
        try:
            temp = pd.read_csv(file, header=None).values.flatten()
            df3 = df3.append({"col": col, "contents": temp}, ignore_index=True)
        except:
            df3 = df3.append({"col": col, "contents": None}, ignore_index=True)

you get the following DataFrame:

    col contents
0   A   [A1]
1   B   [A2, B1]
2   C   [A3, B3, C3]
3   D   None

Read multiple files in python and combine filenames and content into a dataframe

2 Answers2