1

Sorry for the simple question but I am new to Python and any help would be greatly appreciated! I am looking to import a txt file into a Python numpy array. During this import I need to replace several strings using regular expressions (regex). The txt file has the following structure, and are GBs in size so performance is relatively important (low memory usage and as few passes as possible of the data):

Date, Time, Open, High
2019/7/21, 23:59:40, 13, 14
2019/8/2, 14:20:29, 14, 15
2019/8/2, 14:38:16, 15, 16

Below is the code I have. From what I've read best practise is to read the file in line by line and apply the regular expressions during this process [1]. The second regular expression is commented out as I'm unsure how to use multiple regular expressions. I have compiled the regular expressions as I understand this is more performant [2].

import numpy as np

regex1 = re.compile('Date, Time')
regex2 = re.compile(',\s')

with open("Data.txt") as f_input:
    data = [regex1.sub('DateTime', line) for line in f_input]
    # data = [regex2.sub('', line, 1) for line in f_input]

parse_datetime = lambda x: np.datetime64(datetime.strptime(x.decode('utf-8'), '%Y/%m/%dT%H:%M:%S'))
array = np.genfromtxt(data, delimiter=", ", names=True, converters={"DateTime":parse_datetime}, dtype=[('DateTime', 'datetime64[s]'), ('Open', 'i4'), ('High', 'i4')], autostrip=True)

Thank you!

  • Try the [pandas](https://pandas.pydata.org/) package. It is very good at reading this type of file. – David Hoffman Aug 09 '20 at 15:21
  • @DavidHoffman thank you. I’ve looked at pandas.read_csv, and although it does seem to be easier to use, my understanding is that numpy arrays are more performant for large files/arrays –  Aug 09 '20 at 15:24
  • What kind of 'performance' do you need? You are creating a structured array, with different dtype for each field. As for loading speed, which takes more time the `data` creation or `genfromtxt`? `pandas` `read_csv` can be quite a bit faster. Dataframes store their data in numpy arrays; extracting it with `values` is relatively fast. – hpaulj Aug 09 '20 at 15:32
  • @hpaulj the loading or data creation speed doesn’t bother me too much (genfromtxt seems to be quite slow), as opposed to what I do with the array afterwards - hence why I was looking to use numpy arrays. I was under the impression pandas arrays were different to numpy arrays, and from the benchmarks I saw slower to work with for large arrays. Are you suggesting I import the txt file into a pandas dataframe and then ‘export’ it to a numpy array (sorry if I’m using incorrect terms but this is new to me)? –  Aug 09 '20 at 15:39
  • I would give pandas another look. It’s performance meets or exceeds numpy in many situations and it’s built for financial analysis (though it’s super useful in pretty much every other domain). Plus it’s really good at handling date times. – David Hoffman Aug 09 '20 at 16:03
  • I forgot to mention that most of pandas is built on top of numpy so if you ever want an array you can get one easily with the `to_numpy` method. – David Hoffman Aug 09 '20 at 16:05
  • @DavidHoffman thank you! I’ll have another look at pandas. If I was to only use numpy did you have any suggestions for my original post? –  Aug 09 '20 at 16:06
  • Sadly I do not. For me pandas has already solved the problem and I’ve never had to use `genfromtxt` – David Hoffman Aug 09 '20 at 16:27
  • I don't think pandas overcomes the problems I am having, having read over the pandas documentation ([1](https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files), [2](https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook-csv)). I am essentially trying to change the structure of the data as I read it in, with my examples regular expressions (Regex 1 and 2) combining the first two columns. –  Aug 10 '20 at 06:35

0 Answers0