1

I'm trying to gather multiple csv files from one folder into a dataframe. With this prior question we realized the real issue is that some csv files (summary files) contain more than one table. As a result, the current solution's product (code below) skips a significant portion of the data.

Is there any reasonable way to gather multiple files, each possibly containing multiple tables?

Alternatively, if this makes it easier, I have, and could use, separate text files for each of the tables contained in the larger summary files.

Anyhow, what I seek, is that a single row of the generated dataframe should contain the data from the three separate text files / three tables inside the summary file.

Here is my code for just adding the text files from their folder.

import pandas as pd 
import os
import glob

#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"

#create a df list using list comprehension
files = [pd.read_csv(file, sep='\t', on_bad_lines='skip') for file in glob.glob(os.path.join(files_folder,"*txt"))] 

#concatanate the list of df's into one df
files_df = pd.concat(files)


print(files_df)
OCa
  • 298
  • 2
  • 13
natural_d
  • 15
  • 5
  • You may start here for single file, multiple tables: [Pandas: read_csv (read multiple tables in a single file)](https://stackoverflow.com/q/36846090/12846804) – OCa Aug 15 '23 at 15:21
  • This is another one on the topic of single file, multiple tables: [Python Pandas - Read csv file containing multiple tables](https://stackoverflow.com/q/34184841/12846804). Make it work with one file and we can solve the final concatenation later. If you seek more than pointers but specific advice, you're going to have to share one file, or a mock-up of it, as a minimal reproducible example. See e.g. [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – OCa Aug 15 '23 at 15:28
  • I strongly encourage you to post some input data if you expect accurate solution. – OCa Aug 22 '23 at 07:11

2 Answers2

0

It seems like you're trying to read multiple text files from a folder and concatenate them into a single DataFrame. However, if the text files contain multiple tables and you're seeing unexpected results, it might be due to the way you're reading and concatenating them. Each table within a text file might have different structures, making direct concatenation problematic.

Here's a rough outline of how you might modify your code to achieve this:

import pandas as pd 
import os
import glob

#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"

# Define a function to process a text file and extract tables
def process_text_file(file_path):
    # Implement logic to extract tables from the file and convert to DataFrames
    # Return a list of DataFrames
    
# Create an empty list to store DataFrames
all_dfs = []

# Loop through the text files and process each
for file_path in glob.glob(os.path.join(files_folder, "*.txt")):
    dataframes_from_file = process_text_file(file_path)
    all_dfs.extend(dataframes_from_file)

# Concatenate all DataFrames into a single DataFrame
final_df = pd.concat(all_dfs)

# Print the final DataFrame
print(final_df)

In the process_text_file function, you would implement the logic to extract tables from a given text file and convert them into separate DataFrames. You might need to use regular expressions, string manipulation, or other techniques to achieve this, depending on the structure of your text files.

Keep in mind that the exact implementation of the process_text_file function would depend on the structure and formatting of your text files. If the tables within the text files have a consistent structure, you should be able to extract the necessary data. If the structure varies, the extraction process might be more complex and might require custom parsing logic.

  • thank you very much! I am looking into how to use the process_text_file function to work with the files I have. I also have the ability to access each of the tables contained in one file in three separate files. would it be more straightforward to read in each of these three files into a data frame (would also need to select specific columns of data, not the entire table in this case) – natural_d Aug 14 '23 at 18:14
  • Don’t be so thankful - this is a chatgpt-generated answer. – David Makogon Aug 15 '23 at 05:37
  • Indeed, it's ALL about the mysterious content of this function! ;p – OCa Aug 15 '23 at 15:27
  • haha ok. this makes sense as I am still struggling to solve this mysterious content – natural_d Aug 15 '23 at 18:39
0

1) Method for single file containing multiple dataframes

Based on Python Pandas - Read csv file containing multiple tables

  1. Force reading a file with an excess of columns, and the dataframe will contain all lines

  2. detect table markers (with this method you have to know or expect those)

  3. split tables based on these markers using groupby

  4. read with columns in excess: (here 10)

df_read = pd.read_csv(your_file, header=None, names=range(10))
  1. Flag tables by recognizing their top left cells. "your table marker ", "another table marker" that might be names of your first column.
table_names = ["table1", "table2"]
df_read['group'] = read_df[0].isin(table_names).cumsum()
  1. Reference your tables into a dictionary:
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df_read.groupby(df_read['group'])}

# Here clean-up separate tables from the generated 'tables' dictionary
  1. Provided it makes sense to do it, you may concatenate them
pd.concat(tables, axis=0)

You'd have to post a minimal example to get more specific advice. How does that look like on your side?

2) Next, gather multiple files

As in Adding multiple text files into a Pandas dataframe ParserError

And Speed Up Concatenating Bin Files with Pandas and Exporting

  1. Define a function dealing with a single file
  2. Comprehend and concatenate the payload
def file_to_df(file, force_max_columns):
    '''convert one file into one temporary dataframe'''
    # insert the above lines #

# Comprehension then concatenation
files = glob.glob(os.path.join(files_folder,"*txt")
force_max_columns = 99
df = pd.concat([file_to_df(file, force_max_columns) for file in files]

# post-processing e.g. drop void excess columns

Almost there. Let us know, but do post minimal example in case of difficulty. You're really going to need to show some file content.

OCa
  • 298
  • 2
  • 13