Python Pandas Data Cleaning

Question

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).

Code

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

input-file

user1@email.com         address1
User2@email.com    address2
 user3@email.com,address3
  user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

Issues

Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)
Rows with tabs or spaces in the beginning needs to be trimmed. I have 'skipinitialspace=True', but seems like this will not remove the tabs
Need to check the 'df.email' to see if this is a valid email regex format. If not, the full row needs to be purged

Would appreciate any help

Stef · Accepted Answer · 2020-06-12T06:47:04.717

0

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). If you want to be comprehensive you can use this monster-regex as well.

edited Jun 12 '20 at 06:47

answered Jun 12 '20 at 06:33

Stef

28,728
2
24
52

Thanks Stef, there is a bug in this code. this drops some valid rows as well – rogerwhite Jun 12 '20 at 06:46
which ones are dropped but shouldn't? – Stef Jun 12 '20 at 06:47
my bad, this works. thanks! Could you pls explain this regex: `'[^ -ÿ]'` ? secondly, for some of the log-files, I get the error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 37: invalid continuation byte`.. do pls help – rogerwhite Jun 12 '20 at 08:05
`'[^ -ÿ]'` means any characters **not** between space (`' '`) and `ÿ` which is the last ASCII character (`chr(255)`) – Stef Jun 12 '20 at 08:08
as for the error: this is obviously a decoding issue. You may try a different encoding by specifying `encoding='latin-1'` in read_csv. [Here](https://docs.python.org/3/library/codecs.html#standard-encodings) is a list of all possible standard endcodings. – Stef Jun 12 '20 at 08:15
Thanks Stef: another follow up question pls: https://stackoverflow.com/questions/62342292/python-pandas-data-cleaning-issue – rogerwhite Jun 12 '20 at 10:24

Python Pandas Data Cleaning

1 Answers1

Linked