1

I created a numpy array from csv by

dtest = np.genfromtxt('data/test.csv', delimiter=",", names = True)

The data has 200 columns named 'name', 'id', and so on. I'm trying to delete the 'id' column.

Can I do that using the name of the column?

ali_m
  • 71,714
  • 23
  • 223
  • 298
Zoe
  • 1,402
  • 1
  • 12
  • 21
  • Have you had a look at this http://stackoverflow.com/questions/1642730/how-to-delete-columns-in-numpy-array? – Chuck Feb 14 '17 at 20:51

3 Answers3

1

The answers in the proposed duplicate, How do you remove a column from a structured numpy array? show how to reference a subset of the fields of a structured array. That may be what you want, but it has a potential problem, which I'll illustrate in a bit.

Start with a small sample csv 'file':

In [32]: txt=b"""a,id,b,c,d,e
    ...: a1, 3, 0,0,0,0.1
    ...: b2, 4, 1,2,3,4.4
    ...: """
In [33]: data=np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None)
In [34]: data
Out[34]: 
array([(b'a1', 3, 0, 0, 0,  0.1), 
       (b'b2', 4, 1, 2, 3,  4.4)], 
      dtype=[('a', 'S2'), ('id', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])

Multifield selection

I can get a 'view' of a subset of the fields with a field name list. The 'duplicate' showed how to construct such a list from the data.dtype.names. Here I'll just type it in, omitting the 'id' name.

In [35]: subd=data[['a','b','c','d']]
In [36]: subd
Out[36]: 
array([(b'a1', 0, 0, 0), (b'b2', 1, 2, 3)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])

The problem is that this isn't regular 'view'. It's fine for reading, but any attempt to write to the subset, raises a warning.

In [37]: subd[0]['b'] = 3
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you (may be) writing to an array returned
by numpy.diagonal or by selecting multiple fields in a structured
array. This code will likely break in a future numpy release --
see numpy.diagonal or arrays.indexing reference docs for details.
The quick fix is to make an explicit copy (e.g., do
arr.diagonal().copy() or arr[['f0','f1']].copy()).
  #!/usr/bin/python3

Making a subset copy is ok. But changes to subd won't affect data.

In [38]: subd=data[['a','b','c','d']].copy()
In [39]: subd[0]['b'] = 3
In [40]: subd
Out[40]: 
array([(b'a1', 3, 0, 0), (b'b2', 1, 2, 3)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])

A simple way to delete the ith field name from the indexing list:

In [60]: subnames = list(data.dtype.names)   # list so its mutable
In [61]: subnames
Out[61]: ['a', 'id', 'b', 'c', 'd', 'e']
In [62]: del subnames[1]

usecols

Since you are reading this array from the csv, you could use usecols to load everything but the 'id' column

Since you have a large number of columns it would easist to do something like:

In [42]: col=list(range(6)); del col[1]
In [43]: col
Out[43]: [0, 2, 3, 4, 5]
In [44]: np.genfromtxt(txt.splitlines(), delimiter=',',names=True, dtype=None,usecols=col)
Out[44]: 
array([(b'a1', 0, 0, 0,  0.1), (b'b2', 1, 2, 3,  4.4)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])

recfunctions

There's a library of functions that can help manipulate structured arrays

In [45]: import numpy.lib.recfunctions as rf
In [47]: rf.drop_fields(data, ['id'])
Out[47]: 
array([(b'a1', 0, 0, 0,  0.1), (b'b2', 1, 2, 3,  4.4)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])

Most functions in this group work by constructing a 'blank' array with the target dtype, and then copying values, by field, from the source to the target.

field copy

Here's the field copy approach used in recfunctions:

In [65]: data.dtype.descr  # dtype description as list of tuples
Out[65]: 
[('a', '|S2'),
 ('id', '<i4'),
 ('b', '<i4'),
 ('c', '<i4'),
 ('d', '<i4'),
 ('e', '<f8')]
In [66]: desc=data.dtype.descr
In [67]: del desc[1]                # remove one field
In [68]: res = np.zeros(data.shape, dtype=desc)  # target
In [69]: res
Out[69]: 
array([(b'', 0, 0, 0,  0.), (b'', 0, 0, 0,  0.)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])
In [70]: for name in res.dtype.names:    # copy by field name
    ...:     res[name] = data[name]

In [71]: res
Out[71]: 
array([(b'a1', 0, 0, 0,  0.1), (b'b2', 1, 2, 3,  4.4)], 
      dtype=[('a', 'S2'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4'), ('e', '<f8')])

Since usually structured arrays have many records, and few fields, copying by field name is relatively fast.

The linked SO cited matplotlib.mlab.rec_drop_fields(rec, names). This essentially does what I just outlined - make a target with the desired fields, and copy fields by name.

newdtype = np.dtype([(name, rec.dtype[name]) for name in rec.dtype.names
                     if name not in names])
Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Under your usecols` heading, how would you do that if you didn't know what column `id` was in? I Really like your `refunctions` information btw. Nice informative answer. – Chuck Feb 15 '17 at 08:06
  • I'd have to do an exploratory read (e.g. genfromtxt with limited number of rows) to find the names. I may be wrong on this, but I imagine most `csv` users know what the column names are before hand. – hpaulj Feb 15 '17 at 08:23
  • Since this was written multifield indexing has changed. Read the structured arrays docs for details. – hpaulj May 01 '21 at 00:53
0

I know you have a comprehensive answer but this is another that I just put together.

import numpy as np

For some sample data file:

test1.csv = 
a   b   c   id
0   1   2   3
4   5   6   7
8   9   10  11

Import using genfromtxt:

d = np.genfromtxt('test1.csv', delimiter="\t", names = True)

d
> array([(0.0, 1.0, 2.0, 3.0), (4.0, 5.0, 6.0, 7.0), (8.0, 9.0, 10.0, 11.0)], 
  dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8'), ('id', '<f8')])

Return a single column from your array by doing:

d['a']
> array([ 0.,  4.,  8.])

To delete the column by the name 'id' you can do the following:

Return a list of the column names by writing:

list(d.dtype.names)
> ['a', 'b', 'c', 'id']

Create a new numpy array by returning only those columns not equal to the string id.

Use a list comprehension to return a new list without your 'id' string:

[b for b in list(d.dtype.names) if b != 'id']
> ['a', 'b', 'c']

Combine to give:

d_new = d[[b for b in list(d.dtype.names) if b != 'id']]

> array([(0.0, 1.0, 2.0), (4.0, 5.0, 6.0), (8.0, 9.0, 10.0)], 
  dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

This returns the array:

a   b   c
0   1   2
4   5   6
8   9   10
Chuck
  • 3,664
  • 7
  • 42
  • 76
0

This may be new functionality in numpy (works in 1.20.2) but you can just slice your named array using a list of names (a tuple of names doesn't work though).

data = np.genfromtxt('some_file.csv', names=['a', 'b', 'c', 'd', 'e'])
# I don't want colums b or d
sliced = data[['a', 'c', 'd']]

I notice that you need to eliminate many columns that are named id. These columns show up as ['id', 'id_1', 'id_2', ...] and so on when parsed by genfromtxt, so you can use some list comprehension to pick out those column names and make a slice out of them.

no_ids = data[[n for n in data.dtype.names if 'id' not in n]]
medley56
  • 1,181
  • 1
  • 14
  • 29