After searching for a while, I have found many related questions/answers to this problem but nothing that really addresses what I am looking for. Basically, I am implementing code in python to be able to query information from a star catalogue (in particular the tycho 2 star catalogue).
This data is stored in a largish (~0.5 gigabyte) text file where each row corresponds to an entry for a star.
A few example rows are
0001 00008 1| | 2.31750494| 2.23184345| -16.3| -9.0| 68| 73| 1.7| 1.8|1958.89|1951.94| 4|1.0|1.0|0.9|1.0|12.146|0.158|12.146|0.223|999| | | 2.31754222| 2.23186444|1.67|1.54| 88.0|100.8| |-0.2
0001 00013 1| | 1.12558209| 2.26739400| 27.7| -0.5| 9| 12| 1.2| 1.2|1990.76|1989.25| 8|1.0|0.8|1.0|0.7|10.488|0.038| 8.670|0.015|999|T| | 1.12551889| 2.26739556|1.81|1.52| 9.3| 12.7| |-0.2
0001 00016 1| | 1.05686490| 1.89782870| -25.9| -44.4| 85| 99| 2.1| 2.4|1959.29|1945.16| 3|0.4|0.5|0.4|0.5|12.921|0.335|12.100|0.243|999| | | 1.05692417| 1.89793306|1.81|1.54|108.5|150.2| |-0.1
0001 00017 1|P| 0.05059802| 1.77144349| 32.1| -20.7| 21| 31| 1.6| 1.6|1989.29|1985.38| 5|1.4|0.6|1.4|0.6|11.318|0.070|10.521|0.051| 18|T| | 0.05086583| 1.77151389|1.78|1.55| 30.0| 45.6|D|-0.2
The information is both delimited and fixed width. Each column contains a different piece of information about the star. Now, for my python utility I would like to be able to quickly search through this information and retrieve the entries for stars that match a set of criteria specified by the user.
For instance, I would like to be able to find all stars with magnitude brighter than 5.5 (col 18 or 19) that have a right ascension between 0 and 30 degrees (col 3) and a declination between -45 and -35 degrees (col 4) efficiently. Now, if I could store all this information in memory it would be easy to read the file into a numpy structured array or pandas dataframe and retrieve the stars I want using logical indexing. Unfortunately, the machine I am working on does not have enough memory to do this (I only ever have about 0.5 gigbytes of memory free at any given time and the rest of the program I am using takes up a good chunk of memory).
My current solution involves walking through each line of the text file, interpreting the data, and storing the entry in memory only if it matches the criteria that was specified. The method I have to do this is
def getallwithcriteria(self, min_vmag=1., max_vmag=17., min_bmag=1., max_bmag=17., min_ra=0., max_ra=360.,
min_dc=-90., max_dc=90., min_prox=3, search_center=None, search_radius=None):
"""
This method returns entire star records for each star that meets the specified criterion. The defaults for each
criteria specify the entire range of the catalogue. Do not call this without changing the defaults as this will
likely overflow memory and cause your system to drastically slow down or crash!
Note that all of the keyword argument do not need to be specified. For instance, we could run
import tychopy as tp
tyc = tp.Tycho('/path/to/catalogue')
star_records = tyc.getallwithcritera(min_vmag=3,max_vmag=4)
to return all stars that have a visual magnitude between 3 and 4.
This method returns a numpy structured array where each element contains the complete record for a star that
matches the criterion specified by the user. The output array has the following dtype:
[('starid', 'U12'),
('pflag', 'U1'),
('starBearing', [('rightAscension', float), ('declination', float)]),
('properMotion', [('rightAscension', float), ('declination', float)]),
('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
('meanEpoch', [('rightAscension', float), ('declination', float)]),
('numPos', int),
('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
('starProximity', int),
('tycho1flag', 'U1'),
('hipparcosNumber', 'U9'),
('observedPos', [('rightAscension', float), ('declination', float)]),
('observedEpoch', [('rightAscension', float), ('declination', float)]),
('observedError', [('rightAscension', float), ('declination', float)]),
('solutionType', 'U1'),
('correlation', float)]
see the readme of the Tycho 2 catalogue for a more formal description of each field.
If no stars are found that match the specified input then an empty numpy array with the above dtype is returned.
Note that both a rectangular and a circular area can be specified. The rectangular search area is specified
using the min_ra/dc max_ra/dc keyword arguments while the circular search area is specified using the
search_center and search_radius keyword arguments where the search_center is a tuple, list, numpy array, or
other array like object which contains the center right ascension in element 0 and the center declination in
element 1. It is not recommended to specify both the circular and rectangular search areas. If the search
areas do not overlap then no stars will be returned.
:param min_vmag: the minimum (brightest) visual magnitude to return
:param max_vmag: the maximum (dimmest) visual magnitude to return
:param min_bmag: the minimum (brightest) blue magnitude to return
:param max_bmag: the maximum (dimmest) blue magnitude to return
:param min_ra: the minimum right ascension to return
:param max_ra: the maximum right ascension to return
:param min_dc: the minimum declination to return
:param max_dc: the maximum declination to return
:param min_prox: the closest proximity to a star to return
:param search_center: An array like object containing the center point from which to search radially for stars.
:param search_radius: A float specifying the radial search distance to use
:return: A numpy structure array containing the star records for stars that meet the specified criteria
"""
# form the dtype list that genfromtxt will use to interpret the star records
dform = [('starid', 'U12'),
('pflag', 'U1'),
('starBearing', [('rightAscension', float), ('declination', float)]),
('properMotion', [('rightAscension', float), ('declination', float)]),
('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
('meanEpoch', [('rightAscension', float), ('declination', float)]),
('numPos', int),
('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
('starProximity', int),
('tycho1flag', 'U1'),
('hipparcosNumber', 'U9'),
('observedPos', [('rightAscension', float), ('declination', float)]),
('observedEpoch', [('rightAscension', float), ('declination', float)]),
('observedError', [('rightAscension', float), ('declination', float)]),
('solutionType', 'U1'),
('correlation', float)]
# initialize a list which will contain the star record strings for stars that match the input criteria
records = []
# loop through each record in the Tycho2 catlogue
for record in self._catalogueFile:
# interpret the record as simply as we can
split_record = record.split(sep="|")
# check that we are examining a good star, that it falls within the bearing bounds, and that it is far
# enough away from other stars
if ("X" not in split_record[1]) and min_ra <= float(split_record[2]) <= max_ra \
and min_dc <= float(split_record[3]) <= max_dc and int(split_record[21]) >= min_prox:
# perform the radial search if the user has specified a center and radius
if search_center is None or pow(pow(float(split_record[2])-search_center[0], 2) +
pow(float(split_record[3])-search_center[1], 2), 1/2.) < search_radius:
# Check to see if we have values for both blue and visual magnitudes, and check to see if these
# magnitudes fall within the specified magnitude bounds
# We need to split this up like this in order to make sure that either the bmag or the vmag exist
if bool(split_record[17].strip()) and bool(split_record[19].strip()) \
and min_bmag <= float(split_record[17]) <= max_bmag \
and min_vmag <= float(split_record[19]) <= max_vmag:
records.append(record+'\n')
# if only the visual magnitude exists then check its bounds - also check if the user has specified
# its bounds
elif not bool(split_record[17].strip()) and bool(split_record[19].strip()) \
and min_vmag <= float(split_record[19]) <= max_vmag and (max_vmag != 17. or min_vmag != 1.):
records.append(record+'\n')
# if only the blue magnitude exists the check its bounds - also check if the user has specified its
# bounds
elif not bool(split_record[19].strip()) and bool(split_record[17].strip()) \
and min_bmag <= float(split_record[17]) <= max_bmag and (max_bmag != 17. or min_bmag != 1.):
records.append(record+'\n')
# otherwise check to see if the use has changed the defaults. If they haven't then store the star
elif max_bmag == 17. and max_vmag == 17. and min_bmag == 1. and min_vmag == 1.:
records.append(record+'\n')
# check to see if any stars met the criteria. If they didn't then return an empty array. If they did then use
# genfromtxt to interpret the string of star records
if not bool(records):
nprecords = np.empty((1,), dtype=dform)
warnings.warn('No stars were found meeting your criteria. Please try again.')
else:
nprecords = np.genfromtxt(BytesIO("".join(records).encode()), dtype=dform, delimiter='|', converters={
0: lambda s: s.strip(),
1: lambda s: s.strip(),
22: lambda s: s.strip(),
23: lambda s: s.strip(),
30: lambda s: s.strip()})
if self._includeProperMotion:
applypropermotion(nprecords, self.newEpoch, copy=False)
# reset the catalogue back to the beginning for future searches
self._catalogueFile.seek(0, os.SEEK_SET)
return nprecords
This is still very slow (although faster than using up all of the memory and pushing everything else into swap). For comparison, this takes about 2-3 minutes each time I need to retrieve stars and I need to retrieve stars from this 40 or so times (with different criteria each time) in the program I am writing this for. The rest of the program takes about 5 seconds total.
My question is now, what is the best way to speed up this process (outside of getting a better computer with more memory). I am open to any suggestions as long as they are explained well and won't take me months to implement. I am even willing to write a function that goes through and modifies the original catalogue file into a better format (fixed width binary file sorted by a specific column) in order to speed things up.
So far I have considered memmap'ing the file but decided against it because I didn't really think it would help with what I need to do. I have also considered creating a database from the data and then using sqlalchemy or something similar to query the data that way; however, I am not super familiar with databases and don't know if that would offer any real speed improvement.