This issue has been discussed before. There isn't a parameter in loadtxt
(or genfromtxt
) that does what you want. In other words, it is not quote sensitive. The python
csv
module has some sort of quote awareness. The pandas
reader is also quote aware.
But processing the lines before passing them to loadtxt
is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.
A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.
Using numpy.genfromtxt to read a csv file with strings containing commas
For example:
txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""
def foo(astr):
# replace , outside quotes with ;
# a bit crude and specialized
x = astr.split('"')
return ';'.join([i.strip(',') for i in x])
txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines()) # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)
produces:
array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
(50, 'Peach, Mango', 60)],
dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])
I hadn't paid attention to np.fromregex
before. Compared to genfromtxt
it is surprisingly simple. To use with my sample txt
I have to use a string buffer:
s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')
It's action distills down to:
pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)
It reads the whole file (f.read()
) and does a findall
which should produce a list like:
[('10', 'Apple, Banana', '20'),
('30', 'Pear, Orange', '40'),
('50', 'Peach, Mango', '60')]
A list of tuples is exactly what a structured array requires.
No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.
Both my foo
and fromregex
assume a specific sequence of numbers and quoted strings. The csv.reader
might be the simplest general purpose quote reader. The join
is required because reader
produces a list of lists, while genfromtxt
wants an iterable of strings (it does its own 'split').
from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)
producing
array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
(50, 'Peach, Mango', 60)],
dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])
Or in following the fromregex
example, the reader
output could be turned into a list of tuples and given to np.array
directly:
np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')