Sorry for the simple question but I am new to Python and any help would be greatly appreciated! I am looking to import a txt file into a Python numpy array. During this import I need to replace several strings using regular expressions (regex). The txt file has the following structure, and are GBs in size so performance is relatively important (low memory usage and as few passes as possible of the data):
Date, Time, Open, High
2019/7/21, 23:59:40, 13, 14
2019/8/2, 14:20:29, 14, 15
2019/8/2, 14:38:16, 15, 16
Below is the code I have. From what I've read best practise is to read the file in line by line and apply the regular expressions during this process [1]. The second regular expression is commented out as I'm unsure how to use multiple regular expressions. I have compiled the regular expressions as I understand this is more performant [2].
import numpy as np
regex1 = re.compile('Date, Time')
regex2 = re.compile(',\s')
with open("Data.txt") as f_input:
data = [regex1.sub('DateTime', line) for line in f_input]
# data = [regex2.sub('', line, 1) for line in f_input]
parse_datetime = lambda x: np.datetime64(datetime.strptime(x.decode('utf-8'), '%Y/%m/%dT%H:%M:%S'))
array = np.genfromtxt(data, delimiter=", ", names=True, converters={"DateTime":parse_datetime}, dtype=[('DateTime', 'datetime64[s]'), ('Open', 'i4'), ('High', 'i4')], autostrip=True)
Thank you!