0

I have a piped text file whose structure looks like this:

seatle|washington|13|35|1|NW|2005-05-10                 
sanfransisco|california|13|31|1|W|2005-05-10            
chicago|illinois|10|33|1|C|2005-05-10               
newyorkcity|newyork|9|32|2|E|2005-05-10                 
DC|DC|9|30|4|E|2005-05-10               
miami|florida|9|20|4|SE|2005-05-10              
dallas|texas|12|22|2|S|2005-05-10    

In the above sample, the 5th column is a category number. There can be n number of categories (in above sample there are 3 categories - 1,2 and 4 but the number can vary) in the dataset. Now what I want is to read the data from this input file pivot the column 3 and 4 based on category in column 5 and create another output file whose structure should look like this:

column 1|column 2|column 6|column 7|column3_catgory1|column3_catgory2|column3_catgory3|....|column4_catgory1|column4_catgory2|column4_catgory3|...          

Thus the output file should have data from the input file column 1,column 2,column 6,column 7 and then data of column 3 mentioned in the respective output column based on the value of column 5 followed by data of column 4 mentioned in the respective output column based on the value of column 5. Thus for the above sample input the output file should look like this:

column 1    |column 2  |column 6|column 7  |column3_1|column3_2|column3_4|column4_1|column4_2|column4_4
seatle      |washington|NW      |2005-05-10|13       |         |         |35       |         |                       
sanfransisco|california|W       |2005-05-10|13       |         |         |31       |         |            
chicago     |illinois  |C       |2005-05-10|10       |         |         |33       |         |               
newyorkcity |newyork   |E       |2005-05-10|         |9        |         |         |32       |                 
DC          |DC        |E       |2005-05-10|         |         |9        |         |         |30

As you can see that since there are 3 categories so column 3 and 4 of input are pivoted into three different columns in the output with there value mentioned in the appropriate column based on the category.For example, for row with seatle since category was 1 so the values of column 3 and column 4 are mentioned in column3_1 and column4_1 respectively with other column 3 category values(column3_2 & column3_4) and column 4 category values (column4_2 & column4_4) empty.

Now this is the code that I have written to read the piped file and to determine how many categories are there but I am unable to proceed ahead of this.

    category = []

    read = open('file1.txt', 'r')

    for line in read:
        fields = line.split('|')
        if fields[4] not in category:
            category.append(fields[4])
    category = map(int, category) #converts the category value to int for sorting
    category.sort()

Can anyone help me how to proceed ahead of this and pivot the values appropriately?

NOTE: even though there is no header in the input file, I want to add the header in output file like I have shown in my above sample output structure. Also in the output I don't want any extra space between between two column values as I have shown in the sample output data for the sample input.I have deliberately added those spaces to make it easier to view what I want.

Jason Donnald
  • 2,256
  • 9
  • 36
  • 49
  • I would check out ```pandas```. You can use ```pandas.read_tables``` to load the output file into a ```pandas.DataFrame```, passing the ```sep='|'```. There are lots of options for working with headers etc. Then it should be easy to split the dataframes, add columns, etc. Also, check out http://pandas.pydata.org/pandas-docs/stable/merging.html , you might be able to accomplish what you are trying to do with the ```pandas.merge``` function. You can take a lot of time to write your own database code, but if you are going to do serious database work in python, use pandas – dermen Jul 20 '15 at 19:41
  • @dermen its not a database work. The input is a file and output is also a file. All I want is to pivot the `column 3` and `column 4` of the input data into different columns in the output based on the `column 5` value in input file – Jason Donnald Jul 20 '15 at 19:54
  • I see. Be that as it may, you are working with column,row style data, and for those purposes, ```pandas``` has some handy tools, I will post an example. – dermen Jul 20 '15 at 20:20

2 Answers2

0

I took your example, modified it a little bit, and then added to code to produce the output file. Maybe not the most elegant method, but seems to work using your example data.

category = []
columns = []

read = open('file1.txt', 'r')

for line in read:
    fields = line.split('|')
    columns.append(fields)
    category.append(int(fields[4]))
category = sorted(set(category))
print(category)

for line in columns:
    print(line)

out = open('fileout.txt', 'w')

out.write('column 1|column 2|column 6|column 7|')
for val in category:
    out.write('column3_{0}|'.format(val))
for val in category:
    out.write('column4_{0}|'.format(val))
out.write('\n')

for line in columns:
    out.write('{0}|{1}|{2}|{3}|'.format(line[0], line[1], line[5], line[6].rstrip()))
    for val in category:
        if int(line[4]) == val:
            out.write('{0}|'.format(line[2]))
        else:
            out.write(' |')
    for val in category:
        if int(line[4]) == val:
            out.write('{0}|'.format(line[3]))
        else:
            out.write(' |')                    
    out.write('\n')
jwinterm
  • 334
  • 3
  • 11
  • I haven't tried your version but does it take care of the dynamic number of categories. I mean in my example post though there are 3 categories in column 5 value but that would not be the case every time. So, if there are say 5 categories then after column 1,2,6,7 we should have 5 column of column 3 and 5 columns of column 4. The categories will vary each time it is run – Jason Donnald Jul 20 '15 at 20:32
  • Yes, it should work for however many unique values there are in column 5. I haven't tested it, but I don't see why it wouldn't. – jwinterm Jul 20 '15 at 20:51
0

Here is a simple solution using pandas.DataFrame:

col_names = [ 'col_%d'%(x+1) for x in xrange(7)] # make up some names
df = pandas.read_table( 'test.txt', sep='|' , names=col_names ) # read in the data

df
#          col_1       col_2  col_3  col_4  col_5 col_6       col_7
#0        seatle  washington     13     35      1    NW  2005-05-10
#1  sanfransisco  california     13     31      1     W  2005-05-10
#2       chicago    illinois     10     33      1     C  2005-05-10
#3   newyorkcity     newyork      9     32      2     E  2005-05-10
#4            DC          DC      9     30      4     E  2005-05-10
#5         miami     florida      9     20      4    SE  2005-05-10
#6        dallas       texas     12     22      2     S  2005-05-10

Here we grab the unique values in column 5:

u_vals = pandas.unique( df.col_5 ) # the unique values in column 5
u_dfs = [ df.query( 'col_5 == %d'%x) for x in u_vals ] # the sub-dataframes corresponding to each unique value of column 5

u_dfs[0] # for example
#          col_1       col_2  col_3  col_4  col_5 col_6       col_7
#0        seatle  washington     13     35      1    NW  2005-05-10
#1  sanfransisco  california     13     31      1     W  2005-05-10
#2       chicago    illinois     10     33      1     C  2005-05-10

Next, prepare to merge the dataframes, by renaming columns 3 and 4. and the dropping column 5 (it might throw warnings because we are modifying the original df object, but disregard)

# now rename the column 3 and 4 corresponding to the unique vals found in column 5
for uval, udf in zip( u_vals, u_dfs ):
    udf.rename( columns={'col_3':'col_3_%d'%uval, 'col_4':'col_4_%d'%uval}, inplace=True)
    udf.drop( labels='col_5', axis=1 , inplace=True) # drop column 5

Next, merge the dataframes

result = reduce(lambda left,right: pandas.merge(left,right,on=[ 'col_1', 'col_2', 'col_6', 'col_7'], how='outer'), u_dfs)
result
#              col_1       col_2  col_3_1  col_4_1 col_6       col_7  col_3_2  \
#0        seatle  washington       13       35    NW  2005-05-10      NaN   
#1  sanfransisco  california       13       31     W  2005-05-10      NaN   
#2       chicago    illinois       10       33     C  2005-05-10      NaN   
#3   newyorkcity     newyork      NaN      NaN     E  2005-05-10        9   
#4        dallas       texas      NaN      NaN     S  2005-05-10       12   
#5            DC          DC      NaN      NaN     E  2005-05-10      NaN   
#6         miami     florida      NaN      NaN    SE  2005-05-10      NaN   

#   col_4_2  col_3_4  col_4_4  
#0      NaN      NaN      NaN  
#1      NaN      NaN      NaN  
#2      NaN      NaN      NaN  
#3       32      NaN      NaN  
#4       22      NaN      NaN  
#5      NaN        9       30  
#6      NaN        9       20     

Now we can re-arrange the columns to ensure they are in the desired order and save

cols_out = ['col_1', 'col_2', 'col_6', 'col_7'] + ['col_3_%d'%x for x in u_vals] + ['col_4_%d'%x for x in u_vals] # do this however you like
print ( cols_out) 
#['col_1', 'col_2', 'col_6', 'col_7', 'col_3_1', 'col_3_2', 'col_3_4', 'col_4_1', 'col_4_2', 'col_4_4']
result.to_csv('result.txt', sep='|', na_rep='NA', index=False, columns=cols_out) # save the result as a new txt file, format however you like

Now result.txt looks like

col_1|col_2|col_6|col_7|col_3_1|col_3_2|col_3_4|col_4_1|col_4_2|col_4_4
seatle|washington|NW|2005-05-10|13.0|NA|NA|35.0|NA|NA
sanfransisco|california|W|2005-05-10|13.0|NA|NA|31.0|NA|NA
chicago|illinois|C|2005-05-10|10.0|NA|NA|33.0|NA|NA
newyorkcity|newyork|E|2005-05-10|NA|9.0|NA|NA|32.0|NA
dallas|texas|S|2005-05-10|NA|12.0|NA|NA|22.0|NA
DC|DC|E|2005-05-10|NA|NA|9.0|NA|NA|30.0
miami|florida|SE|2005-05-10|NA|NA|9.0|NA|NA|20.0
Community
  • 1
  • 1
dermen
  • 5,252
  • 4
  • 23
  • 34
  • the `col3_1` and `col4_1` in your output are separated from other column 3 and 5 categories.The order should be col 1,2,6,7 followed by all col 3 categories followed by all col 4 categories – Jason Donnald Jul 20 '15 at 20:22
  • http://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns – dermen Jul 20 '15 at 20:28
  • You can specify the column order in the output file by passing the ```columns=``` argument to ```to_csv``` – dermen Jul 20 '15 at 20:40
  • By the way, if the columns have headers, then why should the order be important. I would think the order is only important if you do not have column names specified in a header. But then again , I do no know what you are using the output for . Anyway, I updated the answer to include your desired column order – dermen Jul 20 '15 at 21:28