2

There are some similar questions to this, but nothing exact that I can find.

I have a very odd text-file with lines like the following:

field1=1; field2=2; field3=3; field1=4; field2=5; field3=6;

Matlab's textscan() function deals with this very neatly, as you can do this:

array = textscan(fid, 'field1=%d; field2=%d; field3=%d;'

and you will get back a cell-array where each column contains the respective field, and the text is simply ignored.

I'd like to rewrite the code that deals with this file in Python, but Numpy's loadtxt() and genfromtxt() don't seem to have this ability to ignore text interspersed with the desired numbers?

What are some Python ways to strip out the text and only get back the fields? I'm happy to use pandas or another library if required. Thanks!

EDIT: This question was suggested as an answer, but it only gives equivalents to the basic usage of textscan that does not deal with unwanted text in the input. The answer below with fromregex is what I needed.

Tobias Wood
  • 139
  • 8
  • 5
    Possible duplicate of [Python equivalent of Matlab textscan](https://stackoverflow.com/questions/13125447/python-equivalent-of-matlab-textscan) – grshankar Jul 17 '18 at 15:19
  • @grshankar: I would not consider this question a duplicate because those answers point to Numpy's `loadtxt()` and `genfromtxt()`, which don't fit the OP's needs because of the data structure to be handled. I just took the time to read through the docs for Matlab's [`textscan`](http://www.mathworks.com/help/matlab/ref/textscan.html) and I'm pretty sure there is no easy drop-in replacement. The best I could come up with was faking it with regex and put that as an answer. – Steven Rumbalski Jul 17 '18 at 15:47
  • Leave open unless a better duplicate is found! The OP addresses the shortcomings of those answers in his question. – Steven Rumbalski Jul 17 '18 at 15:50

2 Answers2

2

Numpy's fromregex function is basically the same as textscan. It lets you read in based on a regular expression, with groups (parts surrounded by ()) as the values. This works for your example:

data = np.fromregex('temp.txt', r'field1=(\d+); field2=(\d+); field3=(\d+);', dtype='int')

You can also use loadtxt. There is an argument, converters, that lets you provide functions that do the actual conversion from text to a number. You can provide a function that , you just need to provide it a function to strip out the unneeded text.

So in my tests this works:

myconv = lambda x: int(x.split(b'=')[-1])
mycols = [0, 1, 2]
convdict = {i: myconv for i in mycols}
data = np.loadtxt('temp.txt', delimiter=';', usecols=mycols, converters=convdict)

myconv is an anonymous function that takes a value (say 'field1=1'), splits it on the '=', symbol (making ['field1', '1']), takes the last result ('1'), the converts that to a float (1.`).

mycols is just the numbers of the columns you want to keep. Since there is a delimiter at the end of each line, this counts as an empty columns. So we exclude that.

convdict is a dictionary where each key is a column number and each value is the function to convert that column to a number. In this case they are all the same, but you can customize them however you want.

TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
0

Python has no exact equivalent of Matlab's textscan (edit: but numpy has fromregex. See @TheBlackCat's answer for more.)

With more complicated formats regular expressions may get the job done.

import re

line_pat = re.compile(r'field1=(\d+); field2=(\d+); field3=(\d+);')
with open(filepath, 'r') as f:
    array = [[int(n) for n in line_pat.match(line).groups()] for line in f]
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119