Error_Bad_Lines Alternative?

Question

not sure if this is considered a double post, I apologize if it is.

I am having a bit of problem with using Pandas Error_Bad_Lines. I am currently importing around 500 CSV files into a Dataframe and using Pandas Error_Bad_Lines to remove lines with bad data, this works until it reads a CSV file which has a bad data on line 2 of the file, which causes the script to crash. The bad line can be anywhere else and the code works but for some reason being at Line 2 makes the program crash.

I got around the problem by help from a member, who provided an example of reading line by line. The issue with doing it that way is that it takes forever to process (about 45 minutes to a hour to read and output)

I am not sure if there is a fix for the error_bad_lines ignoring the bad data at Line 2, would appreciate it if someone knows how to fix that.

In the mean time what I am attempting to do is to read the First few lines from the CSV to check if any are bad lines, if bad ignore the line and continue reading the rest of the file with the concat function, how would I include such a statement in the pd.concat line ? or is that not possible.

label1="DATE"
label2="TIME"
label3="In1 "
label4="In2 "
header_label = [label1, label2, label3, label4]
df = pd.concat([pd.read_csv(f, sep='  | |,', names=header_label, engine='python', warn_bad_lines=True, error_bad_lines=False) for f in glob.glob("input/*.csv")],
           ignore_index=False)

Input File:

2019/11/13 08:32  10.4,20.4
2019/11/13 13:58  752019/11/13 13:58  .4,123.9
2019/11/13 09:11  10.2,18.5
2019/11/13 09:22  9.5,14.8
2019/11/13 09:23  9.5,14.5
2019/11/13 09:24  9.3,14.6
2019/11/13 09:25  9.5,14.2
2019/11/13 09:26  9.0,13.4
2019/11/13 09:27  9.7,14.9

Current Output:

                          DATE  ...                  In1
2019/11/13 08:32          10.4  ...                  NaN
           13:58  752019/11/13  ...                123.9
           09:09  752019/11/13  ...                123.9
           09:11          10.2  ...                  NaN
           09:22           9.5  ...                  NaN
...                        ...  ...                  ...

Desired Output

           DATE   TIME                   In1                  In2 
0    2019/11/13  08:32                  10.4                  20.4
1    2019/11/13  09:11                  10.2                  18.5
2    2019/11/13  09:22                   9.5                  14.8
3    2019/11/13  09:23                   9.5                  14.5
4    2019/11/13  09:24                   9.3                  14.6
..          ...    ...                   ...                   ...

I am new to programming and not looking for anyone to write code out for me, just to point me in the right direction. Thank you

score 2 · Accepted Answer · answered Aug 17 '20 at 20:54

Adding index_col=False to your invocation of pd.read_csv gets you closer to the result you want. Without it Pandas tries to use your first two columns as indices. Like you, I'm not sure why error_bad_lines=False does not delete the second line.

from io import StringIO
import pandas as pd


f = StringIO("""2019/11/13 08:32  10.4,20.4
2019/11/13 13:58  752019/11/13 13:58  .4,123.9
2019/11/13 09:11  10.2,18.5
2019/11/13 09:22  9.5,14.8
2019/11/13 09:23  9.5,14.5
2019/11/13 09:24  9.3,14.6
2019/11/13 09:25  9.5,14.2
2019/11/13 09:26  9.0,13.4
2019/11/13 09:27  9.7,14.9""")

label1="DATE"
label2="TIME"
label3="In1 "
label4="In2 "
header_label = [label1, label2, label3, label4]
df = pd.read_csv(f, sep='  | |,', names=header_label, engine='python',
                 warn_bad_lines=True, error_bad_lines=False, index_col=False)

print(df)

         DATE   TIME          In1    In2 
0  2019/11/13  08:32          10.4   20.4
1  2019/11/13  13:58  752019/11/13  13:58
2  2019/11/13  09:11          10.2   18.5
3  2019/11/13  09:22           9.5   14.8
4  2019/11/13  09:23           9.5   14.5
5  2019/11/13  09:24           9.3   14.6
6  2019/11/13  09:25           9.5   14.2
7  2019/11/13  09:26           9.0   13.4
8  2019/11/13  09:27           9.7   14.9

Thank you, that got me much closer to the output. Would you happen to know if there is a method to be able to compare the length of each row in a Dataframe and delete if row is longer then a certain value. — zeroz, Aug 17 '20 at 21:11
This answer seems relevant to that: https://stackoverflow.com/a/34286313/1072246 — rcriii, Aug 17 '20 at 22:06
Thank you @rciii, I was able to solve the problem by using your advice, and creating my own functions to check each column — zeroz, Sep 02 '20 at 23:11

Error_Bad_Lines Alternative?

1 Answers1