1

If I have a txt file and it contains something like this:

AGCGTTGATAGTGCAGCCATTGCAAAACTTCACCCTA
AGCGTTGATAGTGCAGCCATTGCAAAACTTCACCCTA
AAGAAACGAGTATCAGTAGGATGCAGACGGTTGATTG   

But there are "\n" between lines.
And now if I want to make triplets out of them. Then is there a way to read the whole txt file as a line so it wouldn't give me:

'CAA', 'TGC', '\nAG', 'CGT', 'TGA', 'TAG', 'TGC', 'AGC',   

I uploaded my whole code I have at the moment because none of the given answers seemed to help.
That's the code I'm using to split the whole string into triplets:

fob = open("Exercise.txt", "r")
def read_from_file(filename): 
    raw_txt = filename.read()
    triplets = [raw_txt[i:i+3] for i in range(0, len(raw_txt), 3)]
read_from_file(fob)
colidyre
  • 4,170
  • 12
  • 37
  • 53
  • What's symbol count for every line? – Basilevs Oct 04 '15 at 15:10
  • Possible duplicate of [Strip spaces/tabs/newlines - python](http://stackoverflow.com/questions/10711116/strip-spaces-tabs-newlines-python) – Basilevs Oct 04 '15 at 15:12
  • I'm not quite sure what you are asking for –  Oct 04 '15 at 15:13
  • Your example contains lines of length 37. Is this intended? – Basilevs Oct 04 '15 at 15:15
  • Nope, I added this randomly actually it has 210 –  Oct 04 '15 at 15:24
  • Is the input file small enough to easily fit into RAM, or would it be better if your code can read data as it needs it? If so, you should definitely consider the approach shown in poke's 2nd code block. – PM 2Ring Oct 04 '15 at 15:25
  • So every line can be split to triplets. You can therefore read file line-by-line, removing EOL symbol as needed. – Basilevs Oct 04 '15 at 15:33
  • I edited the code because none of these answers seemed to help wen i tried them.. Can anyone tell me what i am doing wrong? –  Oct 04 '15 at 15:37
  • I can. Upload code you've actually tried, not the one we know does not work. – Basilevs Oct 04 '15 at 15:42
  • The problem for me is that i have to use a function there fn(filename) which gets the txt file as its value but I do not know how to use it inside a function. When i try to do as people in the "Answers" sections suggested i just get errors. –  Oct 04 '15 at 15:49
  • @Donka, you say you've executed some code and has got some errors. This won't help people to reprodce your problem and help you to solve it. Please read http://stackoverflow.com/help/how-to-ask – Basilevs Oct 15 '15 at 04:35

4 Answers4

2
raw_txt = ''.join(line.rstrip('\n') for line in f.readlines())

Or as @PM 2Ring suggested:

raw_txt = ''.join(f.read().splitlines())
Community
  • 1
  • 1
xiº
  • 4,605
  • 3
  • 28
  • 39
  • If you could _guarantee_ that the file ends in a newline you could just do `''.join([line[:-1] for line in f.readlines()])`. OTOH, my preference is for `''.join(f.read().splitlines())`. – PM 2Ring Oct 04 '15 at 15:18
2

You don't need to call readlines, just iterate over the file obejct rstripping each line:

with open("test.txt") as f:
    line = "".join([line.rstrip() for line in f])

Or combine it with map:

with open("test.txt") as f:
    line = "".join(list(map(str.rstrip,f)))

rstrip will also take care of whatever your line endings are, there is no need to pass any arguments.

If you want the slices just call iter on the joined string and zip:

line = iter("".join(list(map(str.rstrip, f))))
for sli in zip(line, line, line):
     print("".join(sli))

If you have data that was not a multiple of 3 and you did not want to lose it, you could use itertools.zip_longets:

from itertools import zip_longest
with open("test.txt") as f:
    line = iter("".join(list(map(str.rstrip, f))))
    for sli in zip_longest(line,line,line, fillvalue=""):
        print("".join(sli))

On your sample input both will output:

AGC
GTT
GAT
AGT
GCA
GCC
ATT
GCA
AAA
CTT
CAC
CCT
AAG
CGT
TGA
TAG
TGC
AGC
CAT
TGC
AAA
ACT
TCA
CCC
TAA
AGA
AAC
GAG
TAT
CAG
TAG
GAT
GCA
GAC
GGT
TGA
TTG
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Why iteration over file object is smarter? – Basilevs Oct 04 '15 at 15:34
  • @Basilevs, because you are not keeping another copy of the file in memory for no reason `"".join([line.rstrip() for line in f.readline()])` will end up with three copies of the file in memory, the list comp, the join and the the readlines call. – Padraic Cunningham Oct 04 '15 at 15:35
1

Just read the whole file and remove new lines:

with open('file') as f:
    text = f.read().replace('\n', '')
    triplets = [text[i:i+3] for i in range(0, len(text), 3)]

You could also avoid reading the whole file into the memory and read from it iteratively while selecting triplets. You could even make this very lazy by using generator functions and function composition (this makes it very functional):

def getCharacters (fileName):
    with open(fileName) as f:
        for line in f:
            yield from line.rstrip()

def getTriplets (source):
    it = [iter(source)] * 3
    for triplet in zip(*it):
        yield ''.join(triplet)

# and get a list of triplets
triplets = list(getTriplets(getCharacters('file'))
poke
  • 369,085
  • 72
  • 557
  • 602
  • 2
    And of course, if you wish to decode those triplets to amino acids there's no need to build a list of them, you can do something like `for triplet in getTriplets(getCharacters('file')):` `aa = codon[triplet]`, where `codon` is a `dict` of amino acids indexed by triplet string. – PM 2Ring Oct 04 '15 at 15:23
  • Can you explain the "def getTriplets"? What does the "iter" and zip(*) do –  Oct 04 '15 at 16:20
  • @Donka That’s a [function definition](https://docs.python.org/3/tutorial/controlflow.html#defining-functions). [iter](https://docs.python.org/3/library/functions.html#iter) creates an iterator from an iterable, and [zip](https://docs.python.org/3/library/functions.html#zip) combines multiple iterables. Please read a tutorial if you really don’t know what functions are… – poke Oct 04 '15 at 16:50
  • Yeah, i did. Thank you alot for answering. It helped me out alot –  Oct 04 '15 at 17:03
0

I dont know whether I have solved the question, but do test my code.

I have just modified your code.

As you mentioned in some comments you want to strip newlines in the middle of the file.

So for this I didn't stripped it but I replaced '\n' with '', using

rtxt = raw_txt.replace('\n', '')

here is the code :

fob = open("Exercise.txt", "r")
def read_from_file(filename): 
    raw_txt = filename.read()
    rtxt = raw_txt.replace('\n', '')
    triplets = [rtxt[i:i+3] for i in range(0, len(rtxt), 3)]
    print triplets
read_from_file(fob)

The Output in the triplets list :

['AGC', 'GTT', 'GAT', 'AGT', 'GCA', 'GCC', 'ATT', 'GCA', 'AAA', 'CTT', 'CAC', 'CCT', 'AAG', 'CGT', 'TGA', 'TAG', 'TGC', 'AGC', 'CAT', 'TGC', 'AAA', 'ACT', 'TCA', 'CCC', 'TAA', 'AGA', 'AAC', 'GAG', 'TAT', 'CAG', 'TAG', 'GAT', 'GCA', 'GAC', 'GGT', 'TGA', 'TTG']
Reck
  • 1,388
  • 11
  • 20