5

I'm trying to read a CSV file using numpy.recfromcsv(...) where some of the fields have commas in them. The fields that have commas in them are surrounded by quotes i.e., "value1, value2". Numpy see's the quoted field as two different fields and it doesn't work very well. The command I'm using right now is

    data = numpy.recfromcsv(dataFilename, delimiter=',', autstrip=True)

I found this question

Read CSV file with comma within fields in Python

But it doesn't use numpy, which I'd really love to use. So I'm hoping there are at least one of a few options here:

  1. What are some options to numpy.recfromcsv(...) that will allow me to read a quoted field as one field instead of multiple comma separated fields?
  2. Should I format my CSV file differently?
  3. (alternatively, but not ideally) Read CSV as in quoted question, with extra steps to create numpy array.

Please advise.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
jlconlin
  • 14,206
  • 22
  • 72
  • 105
  • maybe [`pandas.read_csv`](http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files) is an option – bmu Jan 18 '13 at 18:05
  • See this other question answered today http://stackoverflow.com/questions/14396362/how-can-i-efficiently-load-this-kind-of-ascii-files-with-python. The answer suggesting reading the whole file as a single row with '\n' as delimiter, and then defining a custom converter function that splits each line into its elements may be the way to go. – Jaime Jan 18 '13 at 19:04

3 Answers3

2

It is possible to do this with pandas:

np_array = pandas.io.parsers.read_csv("file_with_comma_fields_quoted.csv").as_matrix()
random.me
  • 405
  • 2
  • 6
  • 13
1

If you consider using native Python csv reader, with Python doc here:

Python csv reader defines some optional Dialect.quotechar options, which defaults to '"'. In the csv format standard, quotechar is another field delimiter, and the delimiter (comma in your case) may be included in the quoted field. Rules for quoting character in csv format are clear in first section of this page.

So, it seems that with default quoting character to ", native Python csv reader manages your problem in default mode.

If you want to stick to Python, why not clean your csv file first, using regexp to identify quoted fields, and change delimiter from comma to \t for instance. But here you are actually parsing csv format by yourself.

kiriloff
  • 25,609
  • 37
  • 148
  • 229
0

It turns out the easiest way to do this is to use the standard library module, csv to read in the file into a tuple, then use the tuple as input to a numpy array. I wish I could just read it in with numpy, but that doesn't seem to work.

jlconlin
  • 14,206
  • 22
  • 72
  • 105