4

I have imported a few thousand txt files from a folder into pandas dataframe. Is there any way I can create a column adding a sub-string from the filenames of the imported txt files in it? This is to identify each text file in the dataframe by a unique name.

Text files are named as 1001example.txt, 1002example.txt, 1003example.txt and son on. I want something like this:

filename        text
1001            this is an example text
1002            this is another example text
1003            this is the last example text
....

The code I have used to import the data is below. However, I do not know how to create a column by a sub-string of filenames. Any help would be appreciated. Thanks.

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())

df = pd.DataFrame({'text':corpus})
crackers
  • 327
  • 2
  • 12
  • [pathlib](https://docs.python.org/3/library/pathlib.html) makes this easy with the [stem](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem) method. – sammywemmy Jul 14 '20 at 05:06

2 Answers2

2

This should work. It takes numbers from file name.

import glob
import os
import pandas as pd

file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))

corpus = []
files = []

for file_path in file_list:
    with open(file_path, encoding="latin-1") as f_input:
        corpus.append(f_input.read())
        files.append(''.join([n for n in os.path.basename(file_path) if n.isdigit()]))

df = pd.DataFrame({'file':files, 'text':corpus})
Tarun Pathak
  • 247
  • 4
  • 4
  • Thank you for the answer. It is working. But I want only numeric values as filenames, which range from 3 to 4 digits (100, 1001 etc,). Currently, I get the file name with only 3 digits. Is there any way I could accommodate this in the code (both 3 and 4 digits)? – crackers Jul 14 '20 at 05:22
  • I have updated the code to pick numeric values from file name. It will now pick any leangth – Tarun Pathak Jul 14 '20 at 08:12
1

There is a one-liner:

df = pd.concat([pd.read_csv(f, encoding='latin-1').
                assign(Filename=os.path.basename(f)) for f in glob.glob('K:\\text_all*.txt')])
df['Filename'] = df['Filename'].str.extract('(\d+)').astype(int)
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • aah I see you just want the number.... I will modify – David Erickson Jul 14 '20 at 05:25
  • Thanks for answering. Also I get this error with this code: `ParserError: Error tokenizing data. C error: Expected 1 fields in line 48, saw 2` – crackers Jul 14 '20 at 05:26
  • The error continues: `ParserError: Error tokenizing data. C error: Expected 1 fields in line 48, saw 2` – crackers Jul 14 '20 at 05:32
  • from some googling... it looks like you might want to pass `, error_bad_lines=False` in `read_csv` OR `, comment='#'` I'm sorry, but I don't know what the error is. It works for me. Here are some links: https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data or https://stackoverflow.com/questions/49632641/pandas-parsing-csv-error-expected-1-fields-found-9 – David Erickson Jul 14 '20 at 05:35