10

I have a log file that I tried to read in pandas with read_csv or read_table. I've got this example of results:

0    date=2015-09-17    time=21:05:35     duration=0    etc...

on 1 column.

I would like to split each row, take the names (like date, time, ...) and convert them to columns so I would get:

          date           time     duration   ...
  0    2015-09-17      21:05:35      0              

Thank you !

datascana
  • 641
  • 2
  • 8
  • 16
  • Are there clear separator? E. g 2 or more whitspaces? (`sep=\s{2,}`) ? Or there is problem separators are messed with data? – jezrael Nov 23 '17 at 09:59
  • yes there is sep, i think 1 whitespace, but im getting Expected 30 fields in line 267, saw 31. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. – datascana Nov 23 '17 at 10:08
  • 1
    I was worried about it - data are messed or maybe another problem. So in my opinion without sample data file is really hard answering, because it is really data dependent problem. – jezrael Nov 23 '17 at 10:13
  • isn't there a way to create a dict with the names of the columns and then split on it ? – datascana Nov 23 '17 at 10:16
  • Hard question. Maybe yes, but in my opinion there is better some pure python solution - read rows of file and split to lists. But obviously this tasks are a bit complicated, main it depends of real data. – jezrael Nov 23 '17 at 10:18

2 Answers2

3

I know this is an old post, but I came across this same problem and found a solution. The error Expected n fields in line n, saw n is probably due to each row having different number of columns. This method is also not good if the ordering of columns are different for each row. I wrote a sample code here which converts your log into json and then to pandas Dataframe.

import pandas as pd
import json

path='log_sample.log'

log_data=open(path,'r')
result={}
i=0
for line in log_data:
    columns = line.split('') #or w/e you're delimiter/separator is
    data={}
    for c in columns:
        key = c.split('=')[0]
        value=c.split('=')[1]
        data[key]=value
    result[i]=data
    i+=1
j=json.dumps(result)

df=pd.read_json(j, orient='index')
Ricky Kim
  • 1,992
  • 1
  • 9
  • 18
1

----- Editing answer to account for inconsistent spacing:

Not sure what the pythonic approach should be, but here's a method that could work.

Using OP's data sample as an example:

0    date=2015-09-17    time=21:05:35     duration=0
1    date=2015-09-17    time=21:05:36     duration=0
2    date=2015-09-17    time=21:05:37     duration=0
3    date=2015-09-17    time=21:05:38     duration=0
4    date=2015-09-17    time=21:05:39     duration=0
5    date=2015-09-17    time=21:05:40     duration=0

I loop through each line and split at the equals sign, then grab the desired text:

import pandas as pd

log_data  = open('log_sample.txt', 'r')
split_list = []

for line in log_data:
    thing1 = line.split('=')
    #print(thing1)
    date = thing1[1][:10]
    time = thing1[2][:8]
    dur = thing1[3]

    split_list.append([date, time, dur])

df = pd.DataFrame(split_list, columns=['date', 'time', 'duration'])
df

----- First Answer:

As @jezrael mentions in the comments, you can leverage the "sep" argument within read_csv.

pd.read_csv('test.txt', sep=r'\\t', engine='python') #[1]

See:

  • I've tried that option, it's raising an error Expected 30 fields in line 267, saw 31. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. – datascana Nov 23 '17 at 10:20
  • Could you loop through each line while splitting at the equal sign and assigning variables to a dictionary or list as you need to? – thedatadavis Nov 23 '17 at 10:25
  • 2
    `sep=r'\t'` worked for me instead of the double escape. Which makes sense as r'' acts as the escape for the backslash. – adjpayot Mar 07 '19 at 12:35