-1

I have a large database as a text file (about 1GB) and I am trying to make the information lines as one line to be able to do some data analysis on those lines. The format of the data base is as follows

>Title 1 Line
Data 1 Line
Data 1 Line
Data 1 line 
>Title 2 Line
Data 2 Line
Data 2 Line
Data 2 Line ....

I want my output to be

>Title 1 Line
Data 1 Line
>Title 2 Line
Data 2 Line

Here is my code for doing this:

#opening document using open 
Data = open("Dataset.txt", "r")

#defining file name for lines 

protein = Data.readlines()


#variable defining for rewrite

proteinfinal = ""

for line in protein:
    if ">" in line:
        proteinfinal += line
    else:
        proteinfinal += line.strip("/n")
#removing final and last to remove lines
#print(proteinfinal.strip())
#closing file 
Data.close()

# Make a new file
Data = open("Dataset.txt", "w")
#write to file 
Data.write(proteinfinal)
#close file
Data.close()

Is there anyway to make this go faster this has been running for a while and the code works on smaller subsets of the dataset (10,000) lines in a couple of minutes.

Meadowlion
  • 39
  • 9
  • 4
    Yeah, don't use `protein = Data.readlines()`, just iterate over the file object directly, `for line in data: ...`, then, **don't use concatenation to grow a string*. i.e. don't use `proteinfinal += line`, use a list, `proteinlist = []`, then *append to that list in a loop, then finally at the end of your loop, use `proteinfinal = ''.join(proteinlist)` – juanpa.arrivillaga Jan 14 '20 at 18:49
  • What Python implementation and version are you using? – Stefan Pochmann Jan 14 '20 at 19:09

2 Answers2

1

Yes, don't use readlines, iterate over the file object directly. More importantly, don't use += to grow a list in a loop, that is giving you quadratic behavior. Try the following:

protein_parts = []
with open("Dataset.txt", "r") as f:
    for line in f:
        if ">" in line:
            protein_parts.append(line)
        else:
            protein_parts.append(line.strip("\n"))
proteinfinal = ''.join(protein_parts)

Note, in this particular case, the fastest thing you can probably do is something like:

with open("Dataset.txt", "r") as f_in, open("Dataset0.txt", "w") as f_out:
    for line in f_in:
        if ">" in line:
            f_out.write(line)
        else:
            f_out.write(line.strip("\n"))

Now you have two files, but if you must keep the old name, just do something like:

import os
os.remove("Dataset.txt")
os.rename(""Dataset0.txt", "Dataset.txt")
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • For "quadratic behavior" it's amazingly fast. A million `+=` in 0.75 seconds: https://repl.it/repls/UnevenSpecializedCad – Stefan Pochmann Jan 14 '20 at 19:06
  • @StefanPochmann it may not actually be quadratic. Depending on the interpreter versions, this will actually changes this to a linear-time algorithm, but it's usually not a good idea to depend on that behavior because it can be fooled easily. EDIT: so check out: https://stackoverflow.com/questions/44487537/why-does-naive-string-concatenation-become-quadratic-above-a-certain-length – juanpa.arrivillaga Jan 14 '20 at 19:15
  • Yeah, I know. I just don't like the assertion that it *is* quadratic when it only *can* be quadratic. Though in this case you might be right, as I don't see any other reason why their 10,000 lines would take minutes. That's why I asked them about their Python. That said, I find it even more likely that the real reason is that they're not showing us the real code but a reduced version. – Stefan Pochmann Jan 14 '20 at 19:20
  • Oh, just saw your edit. I did not know *that* yet. Reading now... – Stefan Pochmann Jan 14 '20 at 19:24
  • 1
    Lol. That question you linked to says *"My question is based made on [this comment](https://stackoverflow.com/questions/4435169/good-way-to-append-to-a-string#comment65441380_4435752)"*. And that comment is... **mine** :-D – Stefan Pochmann Jan 14 '20 at 19:32
  • Yes this is just a snippit of larger code, that actually makes the dataset and i used the same syntax for all of it so there are 40+ additions of strings I wasnt aware that it was that taxing – Meadowlion Jan 14 '20 at 20:03
0

You could try splitting the file up using filesplit into smaller files and then using MultiProcessing to do the work concurrently.

TomG12
  • 21
  • 3