Parsing specific columns from a dataset in python

Question

I have a dataset with multiple columns and I am only interested in analyzing the data from six of the columns. It is in a txt file, and I want to load the file and pull out the following columns (0, 1, 2, 4, 6, 7) with the headings (time, mode, event, xcoord, ycoord, phi). There are ten columns total, Here is an example of what the data looks like:

1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000

When I use the following code to parse the data into columns, it only appears to count the data- but I would like to be able to list the data for further analysis. Here is the code I am using from @alko:

import pandas as pd
df = pd.read_csv('filtered.txt', header=None, false_values=None, sep='\s+')[[0, 1, 2, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df

Here is what that code returns:

class 'pandas.core.frame.DataFrame'
Int64Index: 115534 entries, 0 to 115533
Data columns (total 6 columns): 
time      115534  non-null values
mode      115534  non-null values
event     115534  non-null values
xcoord    115534  non-null values
ycoord    115534  non-null values
phi       115534  non-null values
dtypes: float64(3), int64(2), object(1)

So the goal is to pull out these 6 columns from the 10 original, label them, and list them.

You have to check delimiter. Then `str.split(delimiter)`. – Puffin GDI Dec 27 '13 at 15:10 — Puffin GDI, Dec 27 '13 at 15:10
The delimiter is one space in between each column – Julia Dec 27 '13 at 15:15 — Julia, Dec 27 '13 at 15:15

alko · Accepted Answer · 2013-12-27T15:46:17.313

2

You can use pandas' read_csv parser:

import pandas as pd
from StringIO import StringIO
s = """1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.# 000000    0.000000"""

df = pd.read_csv(StringIO(s),header=None, sep='\s+')[[0, 2, 3, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df
#             time mode           event  xcoord     ycoord  phi
# 0  1385940076332    M  subject_avatar     -30 -59.028107  180
# 1  1385940076336    M  subject_avatar     -30 -59.028107  180
# 2  1385940076339    M  subject_avatar     -30 -59.028107  180
# 3  1385940076342    M  subject_avatar     -30 -59.028107  180
# 4  1385940076346    M  subject_avatar     -30 -59.028107  180
# 5  1385940076350    M  subject_avatar     -30 -59.028107  180
# 6  1385940076353    M  subject_avatar     -30 -59.028107  180
# 7  1385940076356    M  subject_avatar     -30 -59.028107  180

Note, that I corrected columns indices, as it seems that ones provided by You in the question are not correct.

edited Dec 27 '13 at 15:46

answered Dec 27 '13 at 15:19

alko

46,136
12
94
102

1

Thank you, that worked! How can I apply this piece of code to my entire dataset instead of just the piece that I copied and pasted? – Julia Dec 27 '13 at 15:41
1

@Julia You should be able to provide `'/Users/Lab/Desktop/vr.txt'` instead of `StringIO` object. Don't forget to handle header if exists and skip rows if needed. I updated link to point to read_csv docs, you'll find details there. – alko Dec 27 '13 at 15:45
Thanks, I think that would have worked, except my dataset is not uniform, there are lines with fewer than 10 columns interspersed throughout and ideally I want to distinguish the data before those lines from the data that comes after. I will ask a separate question to address this. Thanks for your help! – Julia Dec 27 '13 at 16:35
I edited my original question to ask how I can view the data now that it is separated into the 6 different columns. I have tried to use 'print df.values', but that is not giving the complete data. – Julia Jan 06 '14 at 20:04
@Julia You'd better don't edit original question but ask a new one (it's free :) ), as this invalidates already given answers. Also it helps keep one question per topic, wich makes this site more useful for further reference. – alko Jan 06 '14 at 21:07
Thanks for the tip! (I am new to the site) – Julia Jan 06 '14 at 21:12
1

@Julia you're welcome! For your second question, your dataframe is loaded but is too big to be printed out, that's why info only is printed. Consult for example http://stackoverflow.com/a/15006495/1265154 – alko Jan 06 '14 at 21:15

Parsing specific columns from a dataset in python

1 Answers1

Linked