3

I am trying to read a large log file, which has been parsed using different delimiters (legacy changes).

This code works

import os, subprocess, time, re
import pandas as pd

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='[,|;: \t]+', header=None, engine='python', skipinitialspace=True)

        for index, row in df.iterrows():
            print(row[0], row[1])

This works well for the following data

user1@email.com address1
user2@email.com;address2
user3@email.com,address3
user4@email.com;;address4
user5@email.com,,address5

Issue #1: the following row in the input file will break the code. I wish for this to be parsed into 2 columns (not 3)

user6@email.com,,address;6

Issue #2: I wish to replace all single and double quotes in address, but neither of the following seem to work.

df[1]=df[1].str.replace('"','DQUOTES')
df.replace('"', 'DQUOTES', regex=True)

Pls help!

rogerwhite
  • 335
  • 4
  • 16
  • Try doing pre-processing of normalizing the delimiter before reading the file into pandas dataframe. – sushanth Jun 11 '20 at 05:26

1 Answers1

0

You can first read the file into one column and then do the processing step by step in pandas:

  • split into two columns (n=1)
  • replace the quotes
  • if needed (i.e. if there are possibly further columns you don't need) split the address column again and take the first column ([0]) only (here you may want to remove the space from the list of separators). If any commas and semicolons etc. are part of the address then you of course don't need this step.


import io
s= """user1@email.com address1
user2@email.com;address2
user3@email.com,address3
user4@email.com;;address4
user5@email.com,,address5
user6@email.com,,address;6
user6@email.com,,address with "double quotes"
user6@email.com,,address with 'single quotes'
"""
df = pd.read_csv(io.StringIO(s), sep='\n', header=None)

df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'address'})
df.address = df.address.str.replace('\'|"', 'DQUOTES')
df.address = df.address.str.split('[,|;:]+', 1, expand=True)[0]  #depending on what you need

Result:

             email                                   address
0  user1@email.com                                  address1
1  user2@email.com                                  address2
2  user3@email.com                                  address3
3  user4@email.com                                  address4
4  user5@email.com                                  address5
5  user6@email.com                                   address
6  user6@email.com  address with DQUOTESdouble quotesDQUOTES
7  user6@email.com  address with DQUOTESsingle quotesDQUOTES
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thank you @Stef. I have posted a follow-up question. if you could help with that too pls: https://stackoverflow.com/questions/62336455/python-pandas-data-cleaning – rogerwhite Jun 12 '20 at 02:17