0

I tried looking, but clearly I am missing a trick here. I tried to use couple of ideas on splitting a string separated by ; in a DataFrame in Python.

Can anybody tell me what I am doing wrong, I have only just picked up Python and would appreciate help. What I want is to split the string in recipient-address and duplicate the rest of the rows for each row. I have a LOT of log files to get through so it needs to be efficient. I am using Anaconda python version 2.7 o Windows 7 64bit. Thanks. The data in the input looks roughly like this:

#Fields: date-time,sender-address,recipient-address
2015-06-22T00:00:01.051Z, persona@gmail.com, other@gmail.com;mickey@gmail.com
2015-06-22T00:00:01.254Z, personb@gmail.com, mickey@gmail.com

What I am aiming at is:

  #Fields: date-time,sender-address,recipient-address
  2015-06-22T00:00:01.051Z, persona@gmail.com, other@gmail.com
  2015-06-22T00:00:01.051Z, persona@gmail.com, mickey@gmail.com
  2015-06-22T00:00:01.254Z, personb@gmail.com, mickey@gmail.com

I have tried this based on this

for LOGfile in LOGfiles[:1]: 
readin = pandas.read_csv(LOGfile, skiprows=[0,1,2,3], parse_dates=['#Fields: date-time'], date_parser = dateparse )
#s = df['recipient-address'].str.split(';').apply(Series, 1).stack()
df=pandas.concat([Series(row['#Fields: date-time'], row['sender-address'],row['recipient-address'].split(';'))              
                for _, row in readin.iterrows()]).reset_index()

I keep getting the error:

NameError Traceback (most recent call last)

in ()

4 readin = pandas.read_csv(LOGfile, skiprows=[0,1,2,3], parse_dates= ['#Fields: date-time'], date_parser = dateparse )

5 df=pandas.concat([Series(row['#Fields: date-time'], row['sender-address'],row['recipient-address'].split(';'))

----> 6 for _, row in readin.iterrows()]).reset_index()

7

NameError: name 'Series' is not defined

Community
  • 1
  • 1
Kate
  • 1
  • 1

2 Answers2

0

I updated this with more complete/correct code - it now generates one row in the output Dataframe df for each recipient-address in the input logfile.

This might not be the most efficient solution but at least it works :-)

Err, you would get a quicker and easier-for-the-answerer answer if with your question you a) give a complete and executable short example of code you have tried which works to reproduce your error, and b) include sample data needed to reproduce the error, and c) include example output/error messages from the code you show with the data you show. It's probably also a good idea to include version numbers and the platform you are running on. I'm working with 32-bit python 2.7.8 on Windows 7 64-bit.

I created myself some sample data in a file log.txt:

date-time,sender-address,recipient-address
1-1-2015,me@my.com,me1@my.com;me2@my.com
2-2-2015,me3@my.com,me4@my.com;me5@my.com

I then created a complete working example python file (also making some minimal simplifications to your code snippet) and fixed it. My code which works with my data is:

import pandas

LOGfiles = ('log.txt','log.txt')

for LOGfile in LOGfiles[:1]:
    readin = pandas.read_csv(LOGfile, parse_dates=['date-time'])
    #s = df['recipient-address'].str.split(';').apply(Series, 1).stack()
    rows = []
    for _, row in readin.iterrows():
        for recip in row['recipient-address'].split(';'):
            rows.append(pandas.Series(data={'date-time':row['date-time'], 'sender-address':row['sender-address'],'recipient-address':recip}))
    df = pandas.concat(rows)
    print df

The output from this code is:

date-time            2015-01-01 00:00:00
recipient-address             me1@my.com
sender-address                 me@my.com
date-time            2015-01-01 00:00:00
recipient-address             me2@my.com
sender-address                 me@my.com
date-time            2015-02-02 00:00:00
recipient-address             me4@my.com
sender-address                me3@my.com
date-time            2015-02-02 00:00:00
recipient-address             me5@my.com
sender-address                me3@my.com
dtype: object

The main thing I did to find out what was wrong with your code was to break the problem down because your code may be short but it includes several potential sources of problems as well as the split - first I made sure the iteration over the rows works and that the split(';') works as expected (it does), then I started constructing a Series and found I needed the pandas. prefix to Series, and the data={} as a dictionary.

HTH barny

  • Hi, ammened the original setting, is there anyway to get the data into a dataFrame or to pivot the series? – Kate Aug 13 '15 at 09:59
  • df is a dataFrame, isn't it? Can't you write out df to the results file at the end of each logfile? – DisappointedByUnaccountableMod Aug 13 '15 at 10:31
  • I think df ends up as a Series in this case.? I was hoping to append all logs into one huge file... I am running some analysis on it later and the information is all fragmented. Ideally I would like it out again in the same format as I indicated above... my bad for not including it earlier. – Kate Aug 13 '15 at 14:49
0

I updated the code below to add untested code for passing through the first six lines of the logfile directly to the output.

If all you're doing with the csv logfiles is this transformation, then a possibly faster approach - although not without some significant potential disadvantages - would be to avoid csv reader/pandas and process the csv logfiles at a text level, maybe something like this:

LOGfiles = ('log.txt','log.txt')
outfile = open( 'result.csv',"wt")
for LOGfile in LOGfiles[:1]:
    linenumber=0
    for line in open(LOGfile,"rt"):
        linenumber += 1
        if linenumber < 6:
            outfile.write(line)
        else:
            line = line.strip()
            fields = line.split(",")
            recipients = fields[2].split(';')
            for recip in recipients:
                outfile.write(','.join([fields[0],fields[1],recip])+'\n')

Some of the disadvantages of this approach are:

  • The field for recipient-address is hardcoded, as are the fields for output
  • It happens to pass-through the header line - you may want to make this more robust e.g. by reading the header line before getting into the expansion code
  • It assumes that the csv field seperator is hardcoded comma (,) and so won't like if any of the the fields in the csv file contain a comma
  • It probably works OK with ascii csv files, but may barf on extended character sets (UTF, etc.) which are very common found these days
  • It will likely be harder to maintain than the pandas approach

Some of these are quite serious and would take a lot of messing about to fix if you were going to code it yourself - particularly the character sets - so personally it's difficult to strongly recommend this approach - you need to weigh up the pros and cons for your situation.

HTH barny

  • Hi, i need to read the file from 5th line, and I think it might be a wlatin – Kate Aug 14 '15 at 07:30
  • I added a simple line counter (untested - try it out yourself) as an example of passing through the first six lines - maybe you could extend this yourself to meet whatever your needs are. The character set thing is likely to be easiest to manage if you use Python 3, but if you can convince yourself that the code above is always going to work with your logfiles then maybe you can use this code. Another alternative is to use the csv reader which may be more robust with different character sets, but directly handle the output of the csv reader rather than using pandas. No I don't have an example. – DisappointedByUnaccountableMod Aug 14 '15 at 14:27