5

I have a text file containing tabular data. What I need to do is automate the task of writing to a new text file that is comma delimited instead of space delimited, extract a few columns from existing data, reorder the columns.

This is a snippet of the first 4 lines of the original data:

Number of rows: 8542
 Algorithm  |Date   |Time   |Longitude  |Latitude   |Country    
 1  2000-01-03  215926.688  -0.262  35.813  Algeria 
 1  2000-01-03  215926.828  -0.284  35.817  Algeria

Here is what I want in the end:

Longitude,Latitude,Country,Date,Time
-0.262,35.813,Algeria,2000-01-03,215926.688

Any tips on how to approach this?

Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
MLau
  • 61
  • 1
  • 1
  • 3

4 Answers4

7

I guess the file is separated by tabs, not spaces.

If so, you can try something like:

input_file = open('some_tab_separated_file.txt', 'r')
output_file = open('some_tab_separated_file.csv', 'w')
input_file.readline() # skip first line 
for line in input_file:
    (a, date, time, lon, lat, country) = line.strip().split('\t')
    output_file.write(','.join([lon, lat, country, date, time]) + '\n')
input_file.close()
output_file.close()

This code is untested, any bug is left for you as exercise.

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • Hi, it looks like this would reorder the field header names. Will it also reorder the data in the columns or is there another step? – MLau Jan 14 '12 at 00:50
  • providing its tab-separated, should reorder header and data inside a `for line in file` loop. – Paulo Scardine Jan 14 '12 at 00:55
  • Still working on your method. Getting this: "failed to run script - syntax error - can't assign to literal" Changed all the field names to the correct name. – MLau Jan 14 '12 at 01:28
  • Could you join me at http://chat.stackoverflow.com/rooms/6712/paulos ?? I will help you. – Paulo Scardine Jan 14 '12 at 01:37
  • Paulo - I have to have a "reputation" of 20 to use the chat room. the reply function is disabled. Am at 13 now – MLau Jan 14 '12 at 01:42
  • ok, I'm having trouble with the 5th line. I'm replacing the field names that you have provided in the above code, but get a syntax error. Here is the line: (Algorithm, |Date, |Time, |Longitude, |Latitude, |Country, |Continent, |Region, |Admin, |GlobCover, |Biomass, |Koeppen) = line.split('\t') – MLau Jan 14 '12 at 01:55
  • That did it! Your code did the trick. Thank You!! -so what is the "|" character for in this example? – MLau Jan 14 '12 at 01:58
  • I think "|" in python is the bitwise "or" operator, so you cant use it in variable names. – Paulo Scardine Jan 14 '12 at 02:08
  • That makes sense. Thanks for sharing your knowledge, this has been a great experience and resource. – MLau Jan 14 '12 at 02:13
  • @MLau: I hope you increase your reputation soon in order to be able to use the chat. Good luck! – Paulo Scardine Jan 14 '12 at 02:18
4

You could use the csv module and a reader with the ' ' delimiter to read your data in, and use the a writer from the same module (with a comma delimiter) to produce the output.

In fact, the first example in the csv module documentation uses delimiter=' '.

You can use a DictReader/DictWriter and specify the order of the columns in its constructor (fieldnames list: different for reader/writer if you want to re-order) to output the entries in the order you wish.

(You may need to skip/ignore your first two rows when producing the output.)

EDIT:

Here is an example for dealing with multi-word country names:

import cStringIO
import csv

f = cStringIO.StringIO("""A B C
1 2 Costa Rica
3 4 Democratic Republic of the Congo
""")

r = csv.DictReader(f, delimiter=' ', restkey='rest')
for row in r:
    if row.get('rest'):
        row['C'] += " %s" % (" ".join(row['rest']))
    print 'A: %s, B: %s, C: %s' % (row['A'], row['B'], row['C'])

Use the restkey= and concatenate the dict entry for that value, which is a list of what's left over (here restkey='rest'). This prints:

A: 1, B: 2, C: Costa Rica
A: 3, B: 4, C: Democratic Republic of the Congo
Bruno
  • 119,590
  • 31
  • 270
  • 376
  • What happens with a country like "Costa Rica" if you use `delimiter=' '`? – Paulo Scardine Jan 14 '12 at 01:40
  • @PauloScardine, you should be able to use the `restkey` for this (see update). Of course, in the initial example, this will require a bit of clean up with the initial two row, but even when using a plain (non `DictReader`), it should be possible to obtain what's left and join it together. – Bruno Jan 14 '12 at 02:01
0

I guess the important idea is that you have to use '\t' as the delimiter @Paulo Scardine.

I just wanted to add that pandas is a very good library for handling column data.

>>> src = 'path/to/file'
>>> dest = 'path/to/dest_csv'
>>> column_names = ['names', 'of', 'columns']

>>> df = pd.read_csv(src, delimiter='\t', names=column_names)

# Do something in pandas if you need to

>>> df.to_csv(dest, index=False, sep = ';')
Nima Mousavi
  • 1,601
  • 2
  • 21
  • 30
0

str.split() without any arguments will split by any length of whitespace. operator.itemgetter() takes multiple arguments, and will return a tuple.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358