3

The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :

df = pd.read_csv('mylog.log',
            sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])', 
            engine='python', na_values=['-'], header=None,
            usecols=[0, 3, 4, 5, 6, 7, 8,10],
            names=['ip', 'time', 'request', 'status', 'size', 
                'referer','user_agent','req_time'], 
                converters={'status': int, 'size': int, 'req_time': int})

It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised : either

 TypeError: int() argument must be a string, a bytes-like object or a  number, not 'NoneType'

or

ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'

For the sake of the example, here's the line that triggers the second exception:

22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282

To find the number of the incriminated line, I used the following (terribly slow) function :

def search_error_dichotomy(path):    
        borne_inf = 0
        log = open(path)
        borne_sup = len(log.readlines())
        log.close()
        while borne_sup - borne_inf>1:
            exceded = False
            search_index = (borne_inf + borne_sup) // 2
            try:
                pd.read_csv(path,...,...,nrows=search_index)
            except:
                exceded = True
            if exceded:
                borne_sup = search_index
            else:
                borne_inf = search_index

        return search_index

What I'd like to have is something like this :

try:
    pd.read_csv(..........................)
except MyError as e:
    print(e.row_number)

where e.row_number is the number of the messy line.

Thank you in advance.

SOLUTION All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :

  1. Load the dataframe without converters.

    df = pd.read_csv(path,
                     sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])', 
                     engine='python', na_values=['-'], header=None,
                     usecols=[0, 3, 4, 5, 6, 7, 8,10],
                     names=['ip', 'time', 'request', 'status', 'size',
                     'referer', 'user_agent', 'req_time'])
    
  2. Add an 'index' column using .reset_index() .

    df = df.reset_index()
    
  3. Write custom function (to be used with apply), that converts to int if possible, otherwise saves the entry and the 'index' in a dictionary wrong_lines

    wrong_lines = {}
    def convert_int_feedback_index(row,col):
        try:
            ans = int(row[col])
        except:
            wrong_lines[row['index']] = row[col]
            ans = pd.np.nan
        return ans
    
  4. Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')

    df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)
    
Olivier Roche
  • 192
  • 10
  • what is the exception? – anky Oct 07 '18 at 10:36
  • @devssh : Yes I tryed with small nrows, the error occurs at line 271119. As for removing the converters and using .astype or .apply, that sounds like a good idea yet I don't know how to retrieve the number of the row where the exception occurs. – Olivier Roche Oct 07 '18 at 16:02
  • Did you see my suggestion of `reset_index`? It will give you `index` column with row numbers. Combine that with `apply` over multiple columns and you can customize everything. You seem to be getting the row numbers now by default with this alternative approach right? – devssh Oct 07 '18 at 16:04
  • 1
    @devssh : Wow, using 'df.reset_index()' will indeed do the trick! Thank you very much! – Olivier Roche Oct 07 '18 at 16:19

1 Answers1

1

Did you try pd.read_csv(..., nrows=10) to see if it works on even 10 lines?

Perhaps you should not use converters to specify the dtypes.
Load the DataFrame then apply the dtype to columns like df["column"] = df["column"].astype(np.int64) or a custom function like df["column"]=df["column"].apply(lambda x: convert_type(x)) and handle the errors yourself in the function convert_type.
Finally, update the csv by calling df.to_csv("preprocessed.csv", headers=True, index=False).
I don't think you can get the line number from the pd.read_csv itself. That separator itself looks too complex.

Or you can try just reading the csv as a single column DataFrame and use df["column"].str.extract to use regex to extract the columns. That way you control how the exception is to be raised or the default value to handle the error.

df.reset_index() will give you the row numbers as a column. That way if you apply to two columns, you will get the row number as well. It will give you index column with row numbers. Combine that with apply over multiple columns and you can customize everything.

Gulzar
  • 23,452
  • 27
  • 113
  • 201
devssh
  • 1,184
  • 12
  • 28