0

I am reading pipe delimited data from a text file. There are some parsing issues and I am handing the same while pd.read_csv(error_bad_lines=False)

files = [f for f in filepath for f in os.listdir(filepath)]
df_f=[]
for i in files:
  df = df = pd.read_csv(i,usecols=col_lst,sep='|',engine='python',encoding='iso-8859-1',error_bad_lines=False)
  df_f.append(df)

The above method is removing the bad lines due to | parsing issues and going ahead.

Objective: Can I get a list of warning messages for the bad lines in the above example and create list of the same.

Eg.

df_f =[]
bad_line =[]
for i in files:
  df = df = pd.read_csv(i,usecols=col_lst,sep='|',engine='python',encoding='iso-8859-1',error_bad_lines=False)
  #Pseudo Code Below. Need assistance in building it correctly
  if bad_lines:
    bad_line.append(bad_lines)
  df_f.append(df)

So in other words how can I append the warning messages into bad_line list.

Any thought on this would be appreciated.

pythondumb
  • 1,187
  • 1
  • 15
  • 30

1 Answers1

1

Do the same while redirecting errors to a log file. Basically I replaced os with pathlib as it is more readable. I turned warning_bad_lines to true and that was it.

from pathlib import Path
import contextlib

import pandas as pd

# variables replace with real ones
CSVS_DIR = './data'
LOG_DIR = './logs'
COL_LIST = ['your_list', '...'] 

# create log dir if not exist
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)

# direct warning to log.txt
with open(f'{LOG_DIR}\log.txt', 'w') as f:
    with contextlib.redirect_stderr(f):

        dfs_list = [pd.read_csv(csv_file, usecols=COL_LIST,sep='|',engine='python',encoding='iso-8859-1',error_bad_lines=False, warn_bad_lines=True, ) for csv_file in Path(CSVS_DIR).glob('*.csv')]

        df_master = pd.concat(dfs_list)

If we don’t want log files, we can use warning library

import warnings
from pathlib import Path

import pandas as pd

# variables replace with real ones
CSVS_DIR = './data'
COL_LIST = ['your_list', '...'] 

# direct warning to variable:

with warnings.catch_warnings(record=True) as w:

    dfs_list = [pd.read_csv(csv_file, usecols=COL_LIST,sep='|',engine='python',encoding='iso-8859-1',error_bad_lines=False, warn_bad_lines=True, ) for csv_file in Path(CSVS_DIR).glob('*.csv')]

    df_master = pd.concat(dfs_list)
    df_bad_lines_list = [str(bad.message) for bad in w]
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • I do not want to create any log directory. Instead I want to capture the warning message in a list. So in your code, if I do the some modification (as commented lines), hope the should work. – pythondumb Nov 23 '20 at 07:50
  • Ah okay. I usually read the log file, afterwards. Let me edit a way to add it as list without log file – Prayson W. Daniel Nov 23 '20 at 08:05
  • Actually, I shall INSERT the bad_line string to postgresql as an Entry. However, I need to figure out for a single `.csv` file there could be multiple bad_lines. Hence, to ask you, `df_bad_lines_list` is it a dataframe? And `bad.message` comes from `warning.catch_warning() `? – pythondumb Nov 23 '20 at 08:20
  • Ah! Now we will have to write our flow a bit different to include details. This is why I usually opt for log files, and create a separate task to update my database on progress, failures, etc from logs. The log file or the message includes the file name and lines causing issues, no? – Prayson W. Daniel Nov 23 '20 at 09:24
  • Yes. besides, `df_bad_lines_list` is creating a blank list despite bad lines. – pythondumb Nov 23 '20 at 09:32
  • I will try recreate your issue and try on my side – Prayson W. Daniel Nov 23 '20 at 09:48
  • 1
    I have use `f_in_m = io.StringIO() bad_lines = [] with redirect_stderr(f_in_m):` method and it is working as expected. – pythondumb Nov 23 '20 at 11:22