1

I am quite new to pandas. I have a log text file. I am trying to grab few data point from the file. Below is the code that kind of gets me the desired data but not in desired format. I wanted Pandas data frame with two columns.

import os
from collections import Counter
import pandas as pd
#print(os.getcwd())
infile = "myfile.txt"

important = []
keep_phrases = ["Host",
              "User-Agent"
              ]

with open(infile) as f:
    f = f.readlines()

for line in f:
    for phrase in keep_phrases:
        if phrase in line:
            important.append(line)

            break
#print(type(important))
print(important)
#Counter(important)
pd.DataFrame(important)

This does not give me output in two column. I am looking for host and user agent as one row.

Sample of the text file as below

   15 SessionOpen  c aa.bb.cc.ddd 62667 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa.bb.cc.ddd 62667 442374415
   15 RxURL        c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag666
   15 RxHeader     c Host: ll.abrstream.channel4.com
   15 RxHeader     c Connection: keep-alive
   15 RxHeader     c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
   15 RxHeader     c X-Requested-With: ShockwaveFlash/21.0.0.216
   15 RxHeader     c Accept: */*
   15 RxHeader     c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
   15 RxHeader     c Accept-Encoding: gzip, deflate, sdch
   15 RxHeader     c Accept-Language: en-US,en;q=0.8
   15 ReqEnd       c 442374415 1461870946.496117592 1461870947.112555504 0.000315428 0.001363039 0.615074873
   15 SessionOpen  c aa1.bb1.cc1.ddd1 59409 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa1.bb1.cc1.ddd1 59409 442374416
   15 RxURL        c /gpsApi.php
   15 RxHeader     c Content-Length: 0
   15 RxHeader     c Host: map.yanue.net
   15 RxHeader     c Connection: Keep-Alive
   15 RxHeader     c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
   15 ReqEnd       c 442374416 1461870950.580444574 1461870951.139206648 0.000064135 0.001196861 0.557565212
   15 SessionOpen  c aa1.bb1.cc1.ddd1 52179 :8080
   15 SessionClose c pipe
   15 ReqStart     c aa1.bb1.cc1.ddd1 52179 442374417
   15 RxURL        c /gpsApi.php
   15 RxHeader     c Content-Length: 0
   15 RxHeader     c Host: map.yanue.net
   15 RxHeader     c Connection: Keep-Alive
   15 RxHeader     c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
   15 ReqEnd       c 442374417 1461870951.776547432 1461870952.448071241 0.000062943 0.001109123 0.670414686
   18 SessionOpen  c aa.bb.cc.ddd 62670 :8080
   18 SessionClose c pipe
   18 ReqStart     c aa.bb.cc.ddd 62670 442374418
   18 RxURL        c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag667
   18 RxHeader     c Host: ll.abrstream.channel4.com
   18 RxHeader     c Connection: keep-alive
   18 RxHeader     c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
   18 RxHeader     c X-Requested-With: ShockwaveFlash/21.0.0.216
   18 RxHeader     c Accept: */*
   18 RxHeader     c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
   18 RxHeader     c Accept-Encoding: gzip, deflate, sdch
   18 RxHeader     c Accept-Language: en-US,en;q=0.8
   18 ReqEnd       c 442374418 1461870951.920178175 1461870952.507097483 0.001731873 0.001337051 0.585582256
   15 SessionOpen  c aa1.bb1.cc1.ddd1 48034 :8080
   15 SessionClose c pipe
Jezzamon
  • 1,453
  • 1
  • 15
  • 27
manisha
  • 455
  • 2
  • 7
  • 10
  • Hey @manisha, the code and textfile section in your post is not properly formatted. Please at least break the lines. Also what do you want the index be, say session ID? – zaxliu May 25 '16 at 22:50
  • @zaxliu, It should look better now – manisha May 25 '16 at 22:53
  • At this time I was just looking for those two columns, – manisha May 25 '16 at 22:56
  • please `.join('\n')` the `important` list before printing it out so that you have line-breaks. – zaxliu May 25 '16 at 22:57
  • 1st just do a myfile=pd.read_csv ("myfile.txt") (no need for all the io boilerplate you have... see the doc then do your editing/selections in the Pandas dataframe http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – dartdog May 25 '16 at 23:08
  • @zaxliu, adding print "\n".join(important) shows the data in different line, but what I was looking to have [host, user-agent] as daraframe – manisha May 25 '16 at 23:09
  • @manisha, yeah I know. I was just trying to understand what the original txt file looks like. Or maybe you could just post some sample lines of the original file. – zaxliu May 26 '16 at 00:03

1 Answers1

0

You can create a dataframe by creating a list of lists, and then use the dataframe constructor.

Loop through each line of the file, like you've started doing, then split each line into the different columns. You can use re.split to create a list of the columns, limitting the maximum number of splits to treat the last column as one element. Alternatively, if you know each element is always going to be aligned in the same way, you can use slicing to create that list.

import re

df_list = []
with open(infile) as f:
    for line in f:
        # remove whitespace at the start and the newline at the end
        line = line.strip()
        # split each column on whitespace
        columns = re.split('\s+', line, maxsplit=4)
        df_list.append(columns)

You can then use the method in this answer to create the dataframe.

df = pd.DataFrame(df_list)
Community
  • 1
  • 1
Jezzamon
  • 1,453
  • 1
  • 15
  • 27