I am doing some bioinformatics research, and I'm new to python. I wrote this code to interpret a file containing protein sequences. The file "bulk_sequences.txt" contains 71,423 lines of information within itself. Three lines refer to one protein sequence, this first line giving information, including the year the protein was found, (that's what the "/1945" stuff is all about)." With a smaller sample of 1000 lines, it works just fine. But with this large file I've given it, it seems to take an extremely long time. Is there something I can do to simplify this?
It is meant to sort through the file, sort it by year of discovery, and then assign all three lines of protein sequence data to an item within the array "sortedsqncs"
import time
start = time.time()
file = open("bulk_sequences.txt", "r")
fileread = file.read()
bulksqncs = fileread.split("\n")
year = 1933
newarray = []
years = []
thirties = ["/1933","/1934","/1935","/1936","/1937","/1938","/1939","/1940","/1941","/1942"]## years[0]
forties = ["/1943","/1944","/1945","/1946","/1947","/1948","/1949","/1950","/1951","/1952"]## years[1]
fifties = ["/1953","/1954","/1955","/1956","/1957","/1958","/1959","/1960","/1961","/1962"]## years[2]
sixties = ["/1963","/1964","/1965","/1966","/1967","/1968","/1969","/1970","/1971","/1972"]## years[3]
seventies = ["/1973","/1974","/1975","/1976","/1977","/1978","/1979","/1980","/1981","/1982"]## years[4]
eighties = ["/1983","/1984","/1985","/1986","/1987","/1988","/1989","/1990","/1991","/1992"]## years[5]
nineties = ["/1993","/1994","/1995","/1996","/1997","/1998","/1999","/2000","/2001","/2002"]## years[6]
twothsnds = ["/2003","/2004","/2005","/2006","/2007","/2008","/2009","/2010","/2011","/2012"]## years[7]
years = [thirties,forties,fifties,sixties,seventies,eighties,nineties,twothsnds]
count = 0
sortedsqncs = []
for x in range(len(years)):
for i in range(len(years[x])):
for y in bulksqncs:
if years[x][i] in y:
for n in range(len(bulksqncs)):
if y in bulksqncs[n]:
sortedsqncs.append(bulksqncs[n:n+3])
count +=1
print len(sortedsqncs)
end = time.time()
print round((end - start),4)