2

I have a many large tab-separated files saved as .txt, which each have seven columns with the following headers:

#column_titles = ["col1", "col2", "col3", "col4", "col5", "col6", "text"]    

I would like to simply extract the final column named text and save it into a new file with each row being a row from the original file, while are all strings.

EDIT: This is not a duplicate of a similar problem, as splitlines() was not necessary in my case. Only the order of things needed to be improved

Based on -several - other - posts, here is my current attempt:

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    reader = csv.reader(to_read, delimiter = "\t")

    desired_column = [6]        # text column

    for row in reader:
    myColumn = list(row[i] for i in desired_column)

with open(output_file, "wb") as tmp_file:
    writer = csv.writer(tmp_file)

for row in myColumn:
    writer.writerow(row)

What I am getting, is simply the text field from the 2624th row form my input file, with each of the letters in that string being separated out:

H,o,w, ,t,h,e, ,t.e.a.m, ,d,i,d, ,T,h,u,r,s,d,a,y, ,-, ,s,e,e , ,h,e,r,e

I know very little in the world of programming is random, but this is definitely strange!

This post is pretty similar to my needs, but misses the writing and saving parts, which I am also not sure about.

I have looked into using the pandas toolbox (as per one of those links above), but I am unable to due my Python installation, so please only solutions using csv or other built in modules!

Community
  • 1
  • 1
n1k31t4
  • 2,745
  • 2
  • 24
  • 38
  • Possible duplicate of [Tab-delimited file using csv.reader not delimiting where I expect it to](http://stackoverflow.com/questions/12136850/tab-delimited-file-using-csv-reader-not-delimiting-where-i-expect-it-to) – ergonaut Oct 19 '15 at 14:43

2 Answers2

2

You must process the file one row at a time: read, parse and write.

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file, "wb") as tmp_file:
        reader = csv.reader(to_read, delimiter = "\t")
        writer = csv.writer(tmp_file)

        desired_column = [6]        # text column

        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Merci beaucoup! Worked without my having to change a thing. I had all the right ingredients there, just the wrong recipe in my head. – n1k31t4 Oct 19 '15 at 15:15
  • Would you mind editing your code to highlight for me exactly where the "read, parse, write" steps occur? – n1k31t4 Oct 19 '15 at 15:23
1

I would go for this simple solution:

    text_strings = [] # empty array to store the last column text
    with open('my_file') as ff:
        ss = ff.readlines() # read all strings in a string array 

    for s in ss:
        text_strings.append(s.split('\t')[-1]) # last column to the text array



    with open('out_file') as outf:
        outf.write('\n'.join(text_strings)) # write everything to output file

Using the list comprehension, you can translate the last columns of ss strings to text_strings faster and in one line:

    text_strings = [k.split("\t")[-1] for k in ss]

There are other simplifications possible, you get the idea)

The problem in your code appears at these two lines:

        for row in reader:
        myColumn = list(row[i] for i in desired_column)

First, there is no indentation, so there is nothing happening. Actually, on my computer, it throws an error, so there is a possibility that it is a typo. But in this case, at each step of the for-loop, you overwrite the myColumn value with that coming from the new row, hence in the end you have a string from the last row of the file. Second, list applied to a string (as in your code), converts the string to the list of chars:

    In [5]: s = 'AAAA'

    In [6]: list(s)
    Out[6]: ['A', 'A', 'A', 'A']

which is exactly what you see in the output.

pausag
  • 136
  • 8
  • 1
    Hmm, with that sample input it could work, but a csv file could be much more complex. How would you process commas (`,`) or new lines escapes in double quotes ? csv module exists because it correctly deals with all that corner cases – Serge Ballesta Oct 19 '15 at 15:05
  • @pausag - That indentation error was from my bad copy/paste of my code - it isn't in my file. Your example isn't working for me, throwing the error: `**ff.write('\n'.join(text_strings)) # write everything to output file ValueError: I/O operation on closed file**` – n1k31t4 Oct 19 '15 at 15:11
  • @SergeBallesta Good point, and for csv input one should of course account for these. The given problem is however states that there are "large txt tab separated files", so I do not see point of using csv here. BTW, I also did not note the tab separator in the problem. Edit the answer. – pausag Oct 19 '15 at 15:14
  • @DexterMorgan, You are write, this is my typo. there should be `outf.write(...)`, not `ff`. Edited the post – pausag Oct 19 '15 at 15:16
  • I did edit your comma to a `\t`, but I still got the same error as before. The answer from Serge has been accepted, but thanks for teaching me something new about `readlines()`, nevertheless! – n1k31t4 Oct 19 '15 at 15:18