3

I have a file in following format:

10000 
2
2
2
2
0.00
0.00
0 1

0.00
0.01
0 1
...

I want to create a dataframe from this file (skipping the first 5 lines) like this:

x1   x2    y1  y2
0.00 0.00  0   1
0.00 0.01  0   1

So the lines are converted to columns (where each third line is also split into two columns, y1 and y2).

In R I did this as follows:

df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))

I am looking for a python alternative (pandas?) to this scan(file, what=list(...)) function. Does it exist or do I have to write a more extended script?

Andrey Lebedev
  • 429
  • 3
  • 10
2xu
  • 33
  • 2

3 Answers3

3

You can skip the first 5, and then take groups of 4 to build a Python list, then put that in pandas as a start... I wouldn't be surprised if pandas offered something better though:

from itertools import islice, izip_longest

with open('input') as fin:
    # Skip header(s) at start
    after5 = islice(fin, 5, None)
    # Take remaining data and group it into groups of 4 lines each... The
    # first 2 are float data, the 3rd is two integers together, and the 4th
    # is the blank line between groups... We use izip_longest to ensure we
    # always have 4 items (padded with None if needs be)...
    for lines in izip_longest(*[iter(after5)] * 4):
            # Convert first two lines to float, and take 3rd line, split it and
            # convert to integers
        print map(float, lines[:2]) + map(int, lines[2].split())

#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Tnx Jon! If pandas (or something else) has a more concise function like scan() in R, would be awesome. – 2xu Dec 03 '13 at 13:17
  • 1
    +1 nice one @JonClements, could you explain it a bit? – Roman Pekar Dec 03 '13 at 13:22
  • @2xu I don't think it does... but there's people out there with way more pandas experience than I... For non trivial pre-processing - you generally end up writing a custom function that yields valid rows for use in a `DataFrame` anyway... – Jon Clements Dec 03 '13 at 13:22
  • @RomanPekar added a bit - hope it helps - if not, let me know – Jon Clements Dec 03 '13 at 13:26
  • @JonClements and why do you need `iter` around `after5`? – Roman Pekar Dec 03 '13 at 13:31
  • @RomanPekar probably best to point to [this question and answers](http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python) for that :) – Jon Clements Dec 03 '13 at 13:59
0

As far as I know I cannot see any options here http://pandas.pydata.org/pandas-docs/stable/io.html to organize your DataFrame as you want;

But you can achieve it easly:

lines = open('YourDataFile.txt').read() # read the whole file
import re                               # import re
elems = re.split('\n| ', lines)[5:]     # split each element and exclude the first 5 
grouped = zip(*[iter(elems)]*4)          # group them 4 by 4
import pandas as pd                     # import pandas
df = pd.DataFrame(grouped)              # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2']   # columns names

It's not concise, it's not elegant, but it's clear what it does...

Giupo
  • 413
  • 2
  • 9
  • Nice one. Had to look up the *iter(elems)*4 part, but found it. And I'm not looking for elegancy, just brute force :-) – 2xu Dec 03 '13 at 15:39
  • And there was a typo too (elem instead of elems). Glad you understood it ;) – Giupo Dec 03 '13 at 19:04
0

OK, here's how I did it (it is in fact a combo of Jon's & Giupo's answer, tnx guys!):

with open('myfile.txt') as file:
    data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']
2xu
  • 33
  • 2