3

I am trying to take a CSV input file consisting of a single column of strings and convert it to an array in order to run some operations on each string separately. However, once I import the CSV, the resulting array is not structured as I had expected. Here is a code snippet that illustrates my problem:

import csv

regular_array = ["this", "is", "a", "test"]

# Import a csv with the same elements as regular_array in separate rows
# Put it in an array
csv_doc = csv.reader(open('tester.csv', 'rb'), delimiter=",", quotechar='|')

csv_array = []

for row in csv_doc:
    csv_array.append(row)


# Let's see if we can pass an element from these arrays into a function
print len(regular_array[0]), "\n", len(csv_array[0])

# Well that doesn't do what I thought it would
# Let's print out the arrays and see why.

print regular_array[0], "\n", csv_array[0]

# AHA! My arrays have different structures.

As you might expect, I get different results for both operations due to the structure of the arrays. The first array consists of letters and therefore len(regular_array[0]) = 4. The second is consists of elements and len(csv_array[0]) = 1.

For my purposes, I need my arrays to be of the first kind.

My question has two parts: 1) Can someone point me some resources to help me understand what phenomenon I am dealing with? (Not entirely comfortable with the differences between list/array/tuple constructs yet)

2) Is there an approach I can use to convert my CSV input into the first kind of array, or is there an better way to go about storing the data once it is imported?

Thanks in advance.

acpigeon
  • 1,699
  • 9
  • 20
  • 30

3 Answers3

2

This code produces a list of strings:

regular_array = ["this", "is", "a", "test"]

Every row of a cvs file is also a list of strings. When you iterate over them and append them to cvs_array -- a list -- you get a list of lists of strings. Like this:

cvs_array = [['some', 'stuff'], ['some other', 'stuff']]

If you want to make a flat list, like regular_array, use extend instead of append.

>>> list_of_lists = [['some', 'stuff'], ['some other', 'stuff']]
>>> cvs_array = []
>>> for l in list_of_lists:
...     cvs_array.append(l)
... 
>>> cvs_array
[['some', 'stuff'], ['some other', 'stuff']]
>>> cvs_array = []
>>> for l in list_of_lists:
...     cvs_array.extend(l)
... 
>>> cvs_array
['some', 'stuff', 'some other', 'stuff']

You could also use +=; of the approaches here, it seems += fastest by a hair, at least on my machine. But the append approach is much slower. Here are some timings. First, definitions:

>>> import csv
>>> def gen_csv_file(size):
...     with open('test.csv', 'wb') as csv_f:
...         csv_w = csv.writer(csv_f)
...         csv_w.writerows([['item {0} row {1}'.format(i, j) 
                              for i in range(size)] 
                              for j in range(size)])
... 
>>> def read_append(csv_file):
...     csv_list = []
...     for row in csv_file:
...         for item in row:
...             csv_list.append(item)
...     return csv_list
... 
>>> def read_concat(csv_file):
...     csv_list = []
...     for row in csv_file:
...         csv_list += row
...     return csv_list
... 
>>> def read_extend(csv_file):
...     csv_list = []
...     for row in csv_file:
...         csv_list.extend(row)
...     return csv_list
... 
>>> def read_csv(read_func):
...     with open('test.csv', 'rb') as csv_f:
...         csv_r = csv.reader(csv_f)
...         return read_func(csv_r)
... 

Results:

read_append, file size: 10x10
10000 loops, best of 3: 59.4 us per loop
read_concat, file size: 10x10
10000 loops, best of 3: 47.8 us per loop
read_extend, file size: 10x10
10000 loops, best of 3: 48 us per loop
read_append, file size: 31x31
1000 loops, best of 3: 394 us per loop
read_concat, file size: 31x31
1000 loops, best of 3: 290 us per loop
read function: read_extend, file size: 31x31
1000 loops, best of 3: 291 us per loop
read function: read_append, file size: 100x100
100 loops, best of 3: 3.69 ms per loop
read function: read_concat, file size: 100x100
100 loops, best of 3: 2.67 ms per loop
read function: read_extend, file size: 100x100
100 loops, best of 3: 2.67 ms per loop
read function: read_append, file size: 316x316
10 loops, best of 3: 40.1 ms per loop
read function: read_concat, file size: 316x316
10 loops, best of 3: 29.9 ms per loop
read function: read_extend, file size: 316x316
10 loops, best of 3: 30 ms per loop
read function: read_append, file size: 1000x1000
1 loops, best of 3: 425 ms per loop
read function: read_concat, file size: 1000x1000
1 loops, best of 3: 325 ms per loop
read function: read_extend, file size: 1000x1000
1 loops, best of 3: 323 ms per loop

So using append is always slower, and using extend is almost the same as using +=.

senderle
  • 145,869
  • 36
  • 209
  • 233
1

csv.reader() returns each row as a list, so when you're running csv_array.append(row) on the first line in the csv file, you're adding the list ['this'] as the first element of csv_array. The first element of regular_array is a string, whereas the first element of csv_array is a list.

To add the 'cells' of each line in your csv file to csv_array individually, you could do something like this:

for row in csv_doc:
     for cell in row:
          csv_array.append(cell)
Marius
  • 58,213
  • 16
  • 107
  • 105
  • @acpigeon, this is the slower option, I'm afraid. Both `extend` and `+=` are faster. See [my answer](http://stackoverflow.com/a/9476423/577088) for timings if you're interested. – senderle Feb 28 '12 at 15:57
1

Change the code in your for loop to:

for row in csv_doc:
    csv_array += row

See Python append() vs. + operator on lists, why do these give different results? for the difference between the + operator and append.

Community
  • 1
  • 1
caleech
  • 85
  • 4
  • To my surprise, `+=` is even a bit faster than `extend` -- though only by a bit. See [my answer](http://stackoverflow.com/a/9476423/577088) for timings if you're interested. – senderle Feb 28 '12 at 15:56