9

I have a csv file where a line of data might look like this:

10,"Apple, Banana",20,...

When I load the data in Python, the extra comma inside the quotes shifts all my column indices around, so my data is no longer a consistent structure. While I could probably write a complex algorithm that iterates through each row and fixes the issue, I was hoping there was an elegant way to just pass an extra parameter to loadtxt (or some other function) that will properly ignore commas inside quotes and treat the entire quote as one value.

Note, when I load the CSV file into Excel, Excel correctly recognizes the string as one value.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
cas5nq
  • 413
  • 5
  • 11

5 Answers5

4

One way you could do it with a single numpy function call would be to use np.fromregex, which allows you to use Python's regular expression syntax to parse the contents of your text file in any arbitrary way. For example:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)

gives you:

array([['10', 'Apple, Banana', '20'],
       ['30', 'Orange, Watermelon', '40']], dtype=object)

To unpack that regular expression a bit, '(\d+)' will match one or more digits and '"(.+)"' will match one or more of any character inside double quotes. np.fromregex tries to match this expression within every line in your .csv file, and the parts that are inside the brackets are taken as the individual elements in each row of the output array.

If you want a record array as your output with different 'fields' for the three 'columns' in your .csv file, you could specify separate dtypes for each set of brackets in the regex:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')

gives you:

array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)], 
      dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])
ali_m
  • 71,714
  • 23
  • 223
  • 298
1

This issue has been discussed before. There isn't a parameter in loadtxt (or genfromtxt) that does what you want. In other words, it is not quote sensitive. The python csv module has some sort of quote awareness. The pandas reader is also quote aware.

But processing the lines before passing them to loadtxt is quite acceptable. All the function needs is an iterable - something that can feed it lines one at a time. So that can be a file, a list of lines, or generator.

A simple processor would just replace the commas within quotes with some other character. Or replace the ones outside of quotes with a delimiter of your choice. It doesn't have to be fancy to do the job.

Using numpy.genfromtxt to read a csv file with strings containing commas

For example:

txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""

def foo(astr):
    # replace , outside quotes with ;
    # a bit crude and specialized
    x = astr.split('"')
    return ';'.join([i.strip(',') for i in x]) 

txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines())  # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)

produces:

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

I hadn't paid attention to np.fromregex before. Compared to genfromtxt it is surprisingly simple. To use with my sample txt I have to use a string buffer:

s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')

It's action distills down to:

pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)

It reads the whole file (f.read()) and does a findall which should produce a list like:

[('10', 'Apple, Banana', '20'),
 ('30', 'Pear, Orange', '40'),
 ('50', 'Peach, Mango', '60')]

A list of tuples is exactly what a structured array requires.

No fancy processing, error checks or filtering of comment lines. Just a pattern match followed by array construction.


Both my foo and fromregex assume a specific sequence of numbers and quoted strings. The csv.reader might be the simplest general purpose quote reader. The join is required because reader produces a list of lists, while genfromtxt wants an iterable of strings (it does its own 'split').

from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)

producing

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

Or in following the fromregex example, the reader output could be turned into a list of tuples and given to np.array directly:

np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')
Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

I solved this by using my code below.

def transformCommas(line):
    out = ''
    insideQuote = False
    for c in line:
        if c == '"':
            insideQuote = not insideQuote
        if insideQuote == True and c == ',':
            out += '.'
        else:
            out += c
    return out

f = open("data/raw_data_all.csv", "rb")
replaced = (transformCommas(line) for line in f)
rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)

Data:

1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.00
1366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.00
1366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00
0

You can use the Python csv module: https://docs.python.org/2/library/csv.html

Given a data file in csv format:

10,"Apple,Banana",20
20,"Orange,Watermelon",30

With this script:

from csv import reader

with open('data.csv') as f:
    for row in reader(f):
        print row

You get:

['10', 'Apple,Banana', '20']
['20', 'Orange,Watermelon', '30']

Since loadtxt requires an iterable, pass it reader(f):

with open('data.csv') as f:
    data = loadtxt(reader(f), ...)
miraculixx
  • 10,034
  • 2
  • 41
  • 60
Juan Fco. Roco
  • 1,608
  • 14
  • 11
  • `loadtxt(reader(f)...)` doesn't work because `loadtxt` wants an iterable that returns strings. `reader()` produces a list of lists. You need to `join` them: `(';'.join(x) for x in reader(f))`. – hpaulj Feb 12 '15 at 17:16
0

While there is not such a parameter in numpy.loadtxt to ignore quoted or otherwise escaped commas, one alternative that has not been suggested yet would be the following...

Perform a find and replace using some text editor to replace commas with tabs OR save the file in Excel as tab delimited.

When you use numpy.loadtxt, you will simply specify delimiter='\t' instead of comma delimited.

Simple solution that could save you some code...

Chris Collett
  • 1,074
  • 10
  • 15