1

Is there a way to use list comprehension to create a list of tuples with two different conditions.

I am interacting through a Pandas DF and I want to return an entire row in tuple if it matches either condition. The first is if the DF has nan values in any column. The other is if a column in the DF called ODFS_FILE_CREATE_DATETIME doesn't match the regex pattern for the date column. The date column is supposed to have an output that looks like this: 2005242132. 10 number digits. So if the df returns something like 2004dg, it should be picked up as an error and the row should be added to my list of tuples

My sad pathetic attempt:

[tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]

Full Function that contains the two seperate list of tuples:

def process_csv_formatting(csv):
    odfscsv_df = pd.read_csv(csv, header=None,names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
    odfscsv_df['CSV_FILENAME'] = csv.name
    odfscdate_re = re.compile(r"\d{10}")
    #print(odfscsv_df)
    #odfscsv_df = odfscsv_df.replace('', np.nan)
    errortup = [(odfsname, "Bad_ODFS_FILE_CREATE_DATETIME= " + str(cdatetime), csv.name) for odfsname,cdatetime in zip(odfscsv_df['ODFS_LOG_FILENAME'], odfscsv_df['ODFS_FILE_CREATE_DATETIME']) if not odfscdate_re.search(str(cdatetime))]
    emptypdf = pd.DataFrame(columns=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
 
    print([tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values])

    [tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]
    #print(odfscsv_df[(odfscsv_df[column_name].notnull()) & (odfscsv_df[column_name] != u'')].index)
    for index, row in odfscsv_df.iterrows():
        #print((row['WAFER_SCRIBE']))
        print((row['ODFS_FILE_CREATE_DATETIME']))
    #errortup = [x for x in odfscsv_df['ODFS_FILE_CREATE_DATETIME']]
    if len(errortup) != 0:
        #print(errortup)  #put this in log file statement somehow
        #print(errortup[0][2])
        return emptypdf
    else:

        return odfscsv_df

Sample CSV Data. The commas delienate the cells:

2005091432_943SK1J.00J.SK1J-23.FPD.FMGN520.Jx6D36ny5EO53qAtX4.log,,W943SK10,MGN520,0Z0RK072TCD2
2005230137_014SF1J.00J.SF1J-23.WCPC.FMGN520.XlwHcgyP5eFCpZm5cf.log,,W014SF10,MGN520,DM4MU129SEC1
2005240909_001914J.E0J.914J-15.WRO3PC.FMGN520.nZKn7OvjGKw1i4pxiu.log,,K001914E,MGN520,DM3FZ226SEE3
2005242132_001914J.E0J.914J-15.WRO4PC.FMGN520.V8dcLhEgygRj2rP2Df.log,2005242132,K001914E,MGN520,DM3FZ226SEE3
2005251037_001914J.E0J.914J-15.WRO4PC.FMGN520.dyixmQ5r4SvbDFkivY.log,2005251037,K001914E,MGN520,DM3FZ226SEE3
2005251215_949949J.E0J.949J-21.WRO2PP.FMGN520.yp1i4e7a7D1ighkdB7.log,2005251215,K949949E,MGN520,DG2KV122SEF6
2005251231_949949J.E0J.949J-25.WRO2PP.FMGN520.oLQGhc2whAlhC3dSuR.log,2005251231,K949949E,MGN520,DG2KV333SEF3
2005260105_001914J.E0J.914J-15.WRO4PC.FMGN520.wOQMUOfZgkQK9iHJS5.log,2005260105,K001914E,MGN520,DM3FZ226SEE3
2006111130_950909J.00J.909J-22.FPC.FMGN520.UuqeGtw9xP6lLDUW9N.log,2006111130,K9509090,MGN520,DG7LW031SEE7
2006111612_950909J.00J.909J-22.FPC.FMGN520.hoDl3QSNPKhcs4oA2N.log,2006111612,K9509090,MGN520,DG7LW031SEE7
2006120638_006914J.E0J.914J-15.CZPC.FMGN520.qCgFUH2H21ieT641i9.log,2006120638,K006914E,MGN520,DM8KJ568SEC3
2006122226_006914J.E0J.914J-15.CZPC.FMGN520.nSHSp7klxjrQlVTcCu.log,2006122226,K006914E,MGN520,DM8KJ568SEC3
2006130919_006914J.E0J.914J-15.CZPC.FMGN520.Zd6DrMUsCjuEVBFwvn.log,2006130919,K006914E,MGN520,DM8KJ568SEC3
2006140457_007911J.E0J.911J-25.RDR2PC.FMGN520.QPX9r59TnXObXyfibv.log,2006140457,K007911E,MGN520,DN4AU351SED1
2006141722_007911J.E0J.911J-25.WCPC.FMGN520.dNQLkvQlPTplEjJspB.log,2006141722,K007911E,MGN520,DN4AU351SED1
2006160332_007911J.E0J.911J-25.WCPC.FMGN520.DQiH82Ze9fCoaLVbDE.log,2006160332,K007911E,MGN520,DN4AU351SED1
2006170539_007911J.E0J.911J-25.WCPC.FMGN520.TjakhXkmhmlGhfLheo.log,2006170539,K007911E,MGN520,DN4AU351SED1
edo101
  • 629
  • 6
  • 17

1 Answers1

3

Add dtype parameter to import 'ODFS_FILE_CREATE_DATETIME' as dtype string when you call read_csv

odfscsv_df = pd.read_csv(csv, header=None,
                              names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'],
                              dtype={'ODFS_FILE_CREATE_DATETIME': str})

m1 = odfscsv_df.isna().any(1)
s = odfscsv_df['ODFS_FILE_CREATE_DATETIME']
m2 = ~s.astype(str).str.isnumeric()
m3 = s.astype(str).str.len().ne(10)

[tuple(x) for x in odfscsv_df[m1 | m2 | m3].values]
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • Actually its one condition. Everything is converted in to string for regex search. If you notice on the first comprehension I wrote for this isssue: [(odfsname, "Bad_ODFS_FILE_CREATE_DATETIME= " + str(cdatetime), csv.name) for odfsname,cdatetime in zip(odfscsv_df['ODFS_LOG_FILENAME'], odfscsv_df['ODFS_FILE_CREATE_DATETIME']) if not odfscdate_re.search(str(cdatetime))] It only uses one condition. The cdatetime which is converted to string for the search. The regex then checks if it has 10 numeric digits. I don't need to sepreate it into multi conditions. first column is 'ODFS_LOG_FILENAME' – edo101 Jun 30 '20 at 23:35
  • I guess I misunderstand. what is the 2nd condition to pick rows? – Andy L. Jun 30 '20 at 23:38
  • If the value in column 'ODFS_FILE_CREATE_DATETIME' is not a a num containing 10 digits. I purposefully cast the value to string when I feed it into regex to avoid all the wierd cases where Pandas turns my data into int and floats. Casting the value to string simplifies things – edo101 Jun 30 '20 at 23:40
  • Btw, your s = statement gives me the error: raise AttributeError("Can only use .str accessor with string values!") AttributeError: Can only use .str accessor with string values! – edo101 Jun 30 '20 at 23:45
  • Try my updated answer. Add parameter `dtype` to `read_csv` and chain `astype` as I shown in my updated answer. – Andy L. Jul 01 '20 at 00:09
  • I notice you didn't use regex. Would your code account for situations where the date time appears as 2200412d00 or 2020-14-12? Would it be faster to use regex? If so how would you modify it for regex? – edo101 Jul 01 '20 at 00:12
  • it will include both to the list of tuples. It is definitely faster than regex searching – Andy L. Jul 01 '20 at 00:16
  • so m2 = ~s.astype(str).str.isnumeric() checks to see if it is not a number, and m3 = s.astype(str).str.len().ne(10) checks if it is a number but not equal to 10 and adds these seperate instances? – edo101 Jul 01 '20 at 00:16
  • `m2` checks if strings contain any non-digit character. `m3` checks any string has length NOT equal 10. If either `True`, add the row to list of tuples. – Andy L. Jul 01 '20 at 00:19
  • You're brilliant! It works. My question now is, is there a reason you chose the str methods as opposed to regex? Is it faster? better in what? @Andy L – edo101 Jul 01 '20 at 00:23
  • `str` works on the whole series take advantage of vectorized operation, so is is usually faster than regex from `re` module. I also prefer the readable of `str` accessor. – Andy L. Jul 01 '20 at 00:29
  • I see. Okay what if I need to use regex for a more intricate situation. Like where you need to do a more complex pattern match that cannot be addressed by your str methods (which seem to be ideal for a simple situation like this), how would one incorporate regex? – edo101 Jul 01 '20 at 00:33
  • @edo101: in that case, you may need `str.extract`, `str.extracall`, or `str.findall`. More info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html – Andy L. Jul 01 '20 at 00:35
  • I suggest you read docs on them to get familiar with them in case you prefer regex route. – Andy L. Jul 01 '20 at 00:37
  • According to this: https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care string regex extractions are not advised. I tried to modify your code to this but it is not working: m2 = bool(odfscdate_re.search(str(s))) [tuple(x) for x in odfscsv_df[m1 | ~bool(odfscdate_re.search(str(s)))].values] How can i modify m2 to use my re.compile pattern: odfscdate_re = re.compile(r"\d{10}") And actually return the right amount of not match occurences and only output those. The code as I modified it prints everything – edo101 Jul 01 '20 at 00:49
  • https://stackoverflow.com/questions/62668139/how-to-use-regex-re-compile-match-or-findall-in-list-comprehension Sorry to do this to you but here is the follow up question. For clarity. It seems I've dragged you into an inception of Stack overflow issues. – edo101 Jul 01 '20 at 00:58
  • 1
    I posted an answer to your new question. – Andy L. Jul 01 '20 at 01:46