0

This topic has been covered many times here but none of the solutions have worked for me. I am trying to add all text files from a folder into a dataframe. The code below is working IF I only have 1 file in the folder but as soon as I add another, I get the error: ParserError: Error tokenizing data. C error: Expected 1 fields in line 40, saw 2.

import pandas as pd 
import os
import glob

#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"
files = []

#create a df list using list comprehension
files = [pd.read_csv(file) for file in glob.glob(os.path.join(files_folder,"*txt"))] 

#concatanate the list of df's into one df
files_df = pd.concat(files)


print(files_df)

I have also tried this approach with the same result:

import glob
import pandas as pd

path = '/data/TB/WA_dirty_prep_reports'
summary_files = glob.glob(path + "/*.txt")

df_list = []

df_list = (pd.read_csv(file) for file in summary_files)

big_df = pd.concat(df_list, ignore_index=True)
natural_d
  • 15
  • 5
  • Welcome to SO! Which line is then line 40 in your code? On the side, initializing `files = []` is useless. – OCa Aug 13 '23 at 21:16
  • Please print the output of your glob.glob. You're going to have to decompose your code step by step to pinpoint issues. – OCa Aug 13 '23 at 21:29
  • Your code works for me, with 3 dummy .txt files. Something to do with the content of your second file? Does `pd.read_csv` work for each file separately? – OCa Aug 13 '23 at 21:36
  • Does this answer your question? [Python Pandas Error tokenizing data](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data) – OCa Aug 13 '23 at 21:38
  • Look for separator issue at line 40 **inside second txt file** – OCa Aug 13 '23 at 21:40
  • Hi and thank you--I printed the output of glob.glob and it lists the two text files in the folder. I am confused about 'line 40' because I don't have a line 40! – natural_d Aug 13 '23 at 21:42
  • Re: the link you shared. Another issue I am having is that I have tried adding sep='\t' parameter but get an error there as well. These text files actually contain multiple tables, is it possible that is a problem? It seems I would get an error with only one file in that case too – natural_d Aug 13 '23 at 21:44
  • oh, multiple tables. Yes, nice try, but read the [docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) again. `pd.read_csv` expects ONE table. Rather look in the direction of https://stackoverflow.com/q/34184841/12846804 *Read csv file containing multiple tables*. It doesn't look like there's an easy way, but that was from 7 years ago. – OCa Aug 13 '23 at 22:01
  • You might want to edit your question title and body in light of what we've found out. Or rather, post a new one focusing solely on one txt file. – OCa Aug 13 '23 at 22:04
  • ah ok. thanks so much for your help and patience while I am really still new at this! I am looking into editing the title – natural_d Aug 13 '23 at 22:04
  • Actually since you've accepted my (partial) answer, editing this question is probably not going to get you a lot of new answers. People typically screen for questions without accepted answer. Rather post a new one, but focused on the isolated issue of multiple tables. – OCa Aug 13 '23 at 22:07

1 Answers1

0

It looks like some lines inside the txt files do not get parsed as other lines, maybe because of a separator issue. Try this:

files = [pd.read_csv(file, on_bad_lines='skip') for file in glob.glob(os.path.join(files_folder,"*txt"))] 

as suggested in this question.

Of course that might not be satisfactory to you because you will be missing lines.

It does NOT look like the issue is about having either one, or two, or more txt files. You probably get the same issue with your second file alone.

OCa
  • 298
  • 2
  • 13
  • thank you, on_bad_lines= 'skip' parameter seems to be working. i was also able to add the parameter sep='\t' and it prints without error but there is a lot of NaN. I think these files are problematic because they contain multiple tables – natural_d Aug 13 '23 at 21:58