How to Split into Columns

Question

I have a file with two datasets in, which I'd like to read into Python as two columns.

The data is in the form:

xxx yyy    xxx yyy   xxx yyy

and so on, so I understand that I need to somehow split it up. I'm new to Python (and relatively new to programming), so I've struggled a bit so far. At the moment I've tried to use:

def read(file):

    column1=[]
    column2=[]
    readfile = open(file, 'r')
    a = (readfile.read())
    readfile.close()

How would I go about splitting the read in file into column1 and column2?

Are you saying that one line in the file contains a *series* of data with two columns? Or are there line separators (newlines) between those `xxx yyy` pairs? — Martijn Pieters, Oct 21 '14 at 09:29
This might help you: http://stackoverflow.com/questions/9989334/create-nice-column-output-in-python — the_marcelo_r, Oct 21 '14 at 09:30
There are just spaces between the xxx and yyy, as in they are all in notepad on the same line. — NXW, Oct 21 '14 at 09:30
@theMarceloR OP wants the opposite - to retrieve info from file. — Maroun, Oct 21 '14 at 09:30
@NXW The question is, what's your separator? any spaces or only the bigger spaces between `xxx yyy` pairs? — Maroun, Oct 21 '14 at 09:31
Theres 7 spaces between each xxx yyy pair, although I could easily change it to a single space. — NXW, Oct 21 '14 at 09:36
Seems like you'd want something like `col1 = a.split()[::2]` and `col2 = a.split()[1::2]`. — , Oct 21 '14 at 09:36
Is there only one line of data in the file, or are there multiple lines? And if there are multiple lines, do they all contain 3 sets of pairs of data, or can the number of pairs on each line vary? — PM 2Ring, Oct 21 '14 at 10:34

score 2 · Answer 1 · answered Oct 21 '14 at 10:51

This is quite simple with the Python modules Pandas. Suppose you have a data file like this:

>cat data.txt
xxx  yyy  xxx  yyy  xxx yyy
xxx yyy    xxx yyy   xxx yyy
xxx yyy  xxx yyy   xxx yyy
xxx yyy    xxx yyy  xxx yyy
xxx yyy    xxx  yyy   xxx yyy

>from pandas import DataFrame
>from pandas import read_csv
>from pandas import concat
>dfin = read_csv("data.txt", header=None, prefix='X', delimiter=r"\s+")
> dfin
X0   X1   X2   X3   X4   X5
0  xxx  yyy  xxx  yyy  xxx  yyy
1  xxx  yyy  xxx  yyy  xxx  yyy
2  xxx  yyy  xxx  yyy  xxx  yyy
3  xxx  yyy  xxx  yyy  xxx  yyy
4  xxx  yyy  xxx  yyy  xxx  yyy
>dfout = DataFrame()
>dfout['X0'] = concat([dfin['X0'], dfin['X2'], dfin['X4']], axis=0, ignore_index=True)
>dfout['X1'] = concat([dfin['X1'], dfin['X3'], dfin['X5']], axis=0, ignore_index=True)
> dfout
 X0   X1
 0   xxx  yyy
 1   xxx  yyy
 2   xxx  yyy
 3   xxx  yyy
 4   xxx  yyy
 5   xxx  yyy
 6   xxx  yyy
 7   xxx  yyy
 8   xxx  yyy
 9   xxx  yyy
 10  xxx  yyy
 11  xxx  yyy
 12  xxx  yyy
 13  xxx  yyy
 14  xxx  yyy

Hope it helps. Best.

AlvaroAV · Accepted Answer · 2014-10-21T10:05:48.540

This is an easy example about getting the xxx values in column1 and yyy values in column2.

Important! Your file data has to be something like:

xxx yyy xxx yyy xxx yyy
4 spaces between group(xxx yyy xxx yyy) and 1 between each pair data(xxx yyy)

You can use for example another separator logic like this:

xxx,yyy/xxx,yyy/xxx,yyy
And you only have to change data_separator=',' and column_separator='/'

or

xxx-yyy/xxx-yyy/xxx-yyy
And you only have to change data_separator='-' and column_separator='/'

def read(file):
    column1=[]
    column2= []
    readfile = open(file, 'r')
    data_separator = ' '  # one space to separate xxx and yyy
    column_separator = '    '  # 4 spaces to separate groups xxx,yyy    xxx,yyy

    for line in readfile.readlines():  # In case you have more than 1 line
         line = line.rstrip('\n')  # Remove EOF from line
         print line

         columns = line.split(column_separator)  # Get the data groups 
         # columns now is an array like ['xxx yyy', 'xxx yyy', 'xxx yyy']

         for column in columns:
             if not column: continue  # If column is empty, ignore it
             column1.append(column.split(data_separator)[0])
             column2.append(column.split(data_separator)[1])
    readfile.close()

I have a text file containing xxx yyy aaa bbb ttt hhh after calling the function the result is:

column1 = ['xxx', 'aaa', 'ttt']
column2 = ['yyy', 'bbb', 'hhh']

score -2 · Answer 3 · answered Oct 21 '14 at 09:53

in your example the second separation of the dataset is with 3 spaces... so i think datasets are separated with a minimum of two spaces...

#reading a file seems not to be your problem ;)
#works also with more than 3/4/n spaces...
data = 'xxx yyy    xxx yyy             xxx yyy'

#reduce more than two spaces
while '   ' in data:
    data = data.replace('   ', '  ')

#split data-sets who are now separated trough two spaces
data = data.split('  ')

#split into cols for each data-set
data = [x.split(' ') for x in data]

#reshape for better (requested?) access
column1, column2 = zip(*data)

print column1
print column2

output is:

('xxx', 'xxx', 'xxx')
('yyy', 'yyy', 'yyy')

hope it helps you :)

How to Split into Columns

3 Answers3