1

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).

Code

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

input-file

user1@email.com         address1
User2@email.com    address2
 user3@email.com,address3
  user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

Issues

  • Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)
  • Rows with tabs or spaces in the beginning needs to be trimmed. I have 'skipinitialspace=True', but seems like this will not remove the tabs
  • Need to check the 'df.email' to see if this is a valid email regex format. If not, the full row needs to be purged

Would appreciate any help

rogerwhite
  • 335
  • 4
  • 16

1 Answers1

0
df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). If you want to be comprehensive you can use this monster-regex as well.

Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thanks Stef, there is a bug in this code. this drops some valid rows as well – rogerwhite Jun 12 '20 at 06:46
  • which ones are dropped but shouldn't? – Stef Jun 12 '20 at 06:47
  • my bad, this works. thanks! Could you pls explain this regex: `'[^ -ÿ]'` ? secondly, for some of the log-files, I get the error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 37: invalid continuation byte`.. do pls help – rogerwhite Jun 12 '20 at 08:05
  • `'[^ -ÿ]'` means any characters **not** between space (`' '`) and `ÿ` which is the last ASCII character (`chr(255)`) – Stef Jun 12 '20 at 08:08
  • as for the error: this is obviously a decoding issue. You may try a different encoding by specifying `encoding='latin-1'` in read_csv. [Here](https://docs.python.org/3/library/codecs.html#standard-encodings) is a list of all possible standard endcodings. – Stef Jun 12 '20 at 08:15
  • Thanks Stef: another follow up question pls: https://stackoverflow.com/questions/62342292/python-pandas-data-cleaning-issue – rogerwhite Jun 12 '20 at 10:24