I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).
Code
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
df.email = df.email.str.lower()
print(df)
input-file
user1@email.com address1
User2@email.com address2
user3@email.com,address3
user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
single.col2 [spaces at the beginning of the row]
single.col3 [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22
Issues
- Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)
- Rows with tabs or spaces in the beginning needs to be trimmed. I have 'skipinitialspace=True', but seems like this will not remove the tabs
- Need to check the 'df.email' to see if this is a valid email regex format. If not, the full row needs to be purged
Would appreciate any help