I have a large database as a text file (about 1GB) and I am trying to make the information lines as one line to be able to do some data analysis on those lines. The format of the data base is as follows
>Title 1 Line
Data 1 Line
Data 1 Line
Data 1 line
>Title 2 Line
Data 2 Line
Data 2 Line
Data 2 Line ....
I want my output to be
>Title 1 Line
Data 1 Line
>Title 2 Line
Data 2 Line
Here is my code for doing this:
#opening document using open
Data = open("Dataset.txt", "r")
#defining file name for lines
protein = Data.readlines()
#variable defining for rewrite
proteinfinal = ""
for line in protein:
if ">" in line:
proteinfinal += line
else:
proteinfinal += line.strip("/n")
#removing final and last to remove lines
#print(proteinfinal.strip())
#closing file
Data.close()
# Make a new file
Data = open("Dataset.txt", "w")
#write to file
Data.write(proteinfinal)
#close file
Data.close()
Is there anyway to make this go faster this has been running for a while and the code works on smaller subsets of the dataset (10,000) lines in a couple of minutes.