How can I add filename of imported txt files to dataframe in python

Question

I have imported a few thousand txt files from a folder into pandas dataframe. Is there any way I can create a column adding a sub-string from the filenames of the imported txt files in it? This is to identify each text file in the dataframe by a unique name.

Text files are named as 1001example.txt, 1002example.txt, 1003example.txt and son on. I want something like this:

filename        text
1001            this is an example text
1002            this is another example text
1003            this is the last example text
....

The code I have used to import the data is below. However, I do not know how to create a column by a sub-string of filenames. Any help would be appreciated. Thanks.

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())

df = pd.DataFrame({'text':corpus})

[pathlib](https://docs.python.org/3/library/pathlib.html) makes this easy with the [stem](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem) method. — sammywemmy, Jul 14 '20 at 05:06

Tarun Pathak · Accepted Answer · 2020-07-14T08:12:05.073

2

This should work. It takes numbers from file name.

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []
files = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())
        files.append(''.join([n for n in os.path.basename(file_path) if n.isdigit()]))

df = pd.DataFrame({'file':files, 'text':corpus})

edited Jul 14 '20 at 08:12

answered Jul 14 '20 at 05:12

Tarun Pathak

247
4
4

Thank you for the answer. It is working. But I want only numeric values as filenames, which range from 3 to 4 digits (100, 1001 etc,). Currently, I get the file name with only 3 digits. Is there any way I could accommodate this in the code (both 3 and 4 digits)? – crackers Jul 14 '20 at 05:22
I have updated the code to pick numeric values from file name. It will now pick any leangth – Tarun Pathak Jul 14 '20 at 08:12

David Erickson · Answer 2 · 2020-07-14T05:29:40.117

1

There is a one-liner:

df = pd.concat([pd.read_csv(f, encoding='latin-1').
                assign(Filename=os.path.basename(f)) for f in glob.glob('K:\\text_all*.txt')])
df['Filename'] = df['Filename'].str.extract('(\d+)').astype(int)

edited Jul 14 '20 at 05:29

answered Jul 14 '20 at 05:22

David Erickson

16,433
2
19
35

aah I see you just want the number.... I will modify – David Erickson Jul 14 '20 at 05:25
Thanks for answering. Also I get this error with this code: `ParserError: Error tokenizing data. C error: Expected 1 fields in line 48, saw 2` – crackers Jul 14 '20 at 05:26
The error continues: `ParserError: Error tokenizing data. C error: Expected 1 fields in line 48, saw 2` – crackers Jul 14 '20 at 05:32
from some googling... it looks like you might want to pass `, error_bad_lines=False` in `read_csv` OR `, comment='#'` I'm sorry, but I don't know what the error is. It works for me. Here are some links: https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data or https://stackoverflow.com/questions/49632641/pandas-parsing-csv-error-expected-1-fields-found-9 – David Erickson Jul 14 '20 at 05:35

How can I add filename of imported txt files to dataframe in python

2 Answers2