How to read columns of varying length from a text file in NumPy using genfromtxt()?

Question

I have hundreds of text files like these, with each column separated by three spaces. The data is for a year: 12 months and 31 days for each month.

Below, I'm only showing below what's relevant to question:

001 DIST - ADILABAD ANDHRA MEAN TEMP

 DATE  JAN    FEB    MAR . . . .  NOV    DEC  
 01    21.5   24.3   27.1         25.8   22.4  
 02    21.4   24.2   27.1         25.8   22.4  
 .        .      .      .            .      .
 .        .      .      .            .      .
 .        .      .      .            .      . 
 27    23.6   26.8   30.3         23.1   21.3  
 28    23.8   27.0   30.6         22.9   21.3  
 29    23.4          31.0         22.9   21.2  
 30    23.5          31.1         22.6   21.4  
 31    23.8          31.2 . . . .        21.6

I want to read each column into an array and then average it.

For this I'm using the genfromtext() function like this:

import numpy as np
JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC = np.genfromtxt("tempmean_andhra_adilabad.txt", skiprows=3, 
                                                                 unpack=True, invalid_raise=False, 
                                                                 usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), 
                                                                 autostrip=True)

As you can see I've skipped the first three rows and the first column and unpacked each column in an array. Without invalid_raise=False, I was getting the following error:

Traceback (most recent call last):

File "pyshell#32", line 1, in 'module'  
JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC = np.genfromtxt("temp mean_andhra_adilabad.txt",skiprows=3,unpack=True,usecols=(1,2,3,4,5,6,7,8,9,10,11,12),autostrip=True)  
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 1667, in genfromtxt
raise ValueError(errmsg)  

ValueError: Some errors were detected !  
Line #32 (got 12 columns instead of 12)  
Line #33 (got 12 columns instead of 12)  
Line #34 (got 8 columns instead of 12)

I think this problem is because columns have different length? Or some other reason?

I wanted to see the output so I used invalid_raise=False. Now my problem is that when I'm printing any of the array, like JAN I'm only getting 28 elements. i.e. Every array has only 28 elements. It seems that only 28 rows are read for each column as FEB column ends with 28 days. But I need the data for each month i.e. 31 elements for JAN 30 for JUNE etc.

How do I get all elements for each month?

I think it's a very basic question but I'm very new to Python and NumPy and began learning just two weeks back. I've searched a lot of questions on StackOverflow and Google and learned about how to skip rows, columns etc. But I could not find any answer relating to this particular question.

Please suggest some module, function, code etc.

Thanks in advance.

On your suggestion I passed `filling_values=NaN` and getting the following error `NameError: name 'NaN' is not defined` — user3707588, Jun 06 '14 at 10:44
Try `numpy.NaN` or even `0`, interested to know if this works — EdChum, Jun 06 '14 at 10:51
It didn't show any error but still the arrays have 28 elements. And no NAN value filled. Though I'd read about `filling_values` attribute but I didn't use it thinking I won't be able to use `np.mean()` later to find the average. — user3707588, Jun 06 '14 at 10:56
Could you post the data or a link, it's probable that Pandas can handle this and pandas uses numpy to store the data — EdChum, Jun 06 '14 at 11:00
I'm sorry but I don't know how mto post the text file here? And I've read a bit about Pandas package but I don't know how to use it. I'm unable to post as limited characters allowed in the comment. — user3707588, Jun 06 '14 at 11:03
I mean like a link to dropbox or similar. Pandas supports importing fixed width files — EdChum, Jun 06 '14 at 11:05
Oh, here's the [Google Drive link](https://drive.google.com/file/d/0B2QJp9Vh69bsQXEtaUNoVWMtWmM/edit?usp=sharing) — user3707588, Jun 06 '14 at 11:10

Warren Weckesser · Accepted Answer · 2014-06-06T12:38:45.873

You data is not "delimited" by text. Instead it has fixed-width columns. As @EdChum shows in his answer, pandas has a function for reading data with fixed-width columns. You can also use genfromtxt by giving the column widths in the delimiter argument. It looks like the field widths are (4, 7, 7, 7, ...). In the code below, I'll write this as (4,) + (7,)*12:

In [27]: (4,) + (7,)*12
Out[27]: (4, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7)

The default data type used by genfromtxt is np.float64. If a field can't be converted to a float, it will be replaced with nan. So the data at the end of the months with fewer than 31 days will be nan.

In the following, I renamed your file to "temp_mean.txt". Note that your file has an extra blank line at the end, so the argument skip_footer=1 is also used. If you don't use this argument, you'll get an extra row of nan values in data.

In [16]: data = genfromtxt("temp_mean.txt", skiprows=3, delimiter=(4,)+(7,)*12, usecols=range(1,13), skip_footer=1)

In [17]: data.shape
Out[17]: (31, 12)

In [18]: data[:,0]  # JAN
Out[18]: 
array([ 21.5,  21.4,  21.2,  21.2,  21.4,  21.7,  21.8,  22. ,  22. ,
        22.3,  22.3,  22.3,  22.5,  22.5,  22.5,  22.5,  22.5,  22.6,
        22.8,  23.1,  23.1,  22.8,  22.9,  23.1,  23.4,  23.5,  23.6,
        23.8,  23.4,  23.5,  23.8])

In [19]: data[:,1]  # FEB
Out[19]: 
array([ 24.3,  24.2,  24.3,  24.4,  24.6,  24.4,  24.1,  24.4,  24.5,
        24.6,  24.9,  25. ,  25.1,  25.6,  25.7,  25.7,  25.8,  26. ,
        25.9,  25.9,  25.8,  25.8,  25.8,  26.2,  26.5,  26.7,  26.8,
        27. ,   nan,   nan,   nan])

In [20]: data[-1,:]  # Last row.
Out[20]: 
array([ 23.8,   nan,  31.2,   nan,  34.7,   nan,  27.4,  27. ,   nan,
        25.7,   nan,  21.6])

To get the monthly means, you can use np.nanmean:

In [21]: np.nanmean(data, axis=0)
Out[21]: 
array([ 22.5483871 ,  25.35714286,  29.22903226,  32.79333333,
        34.65806452,  31.19666667,  27.89032258,  27.01612903,
        27.66666667,  27.22580645,  24.34666667,  21.81290323])

Thanks, It's working. Can you please explain to me what `delimiter` did? I got it how the widths are (4,7,7,..) but the value of `delimiter`[(4,)+(7,)*12], is it a regex? And how the NaN got appended in the end when you're not using `filling_values`? — user3707588, Jun 06 '14 at 12:08
I updated my answer. `(4,) + (7,)*12` is "tuple" arithmetic: `+` does concatenation, and `*` with a tuple and an integer does repetition. That expression is equivalent to `(4, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7)`. — Warren Weckesser, Jun 06 '14 at 12:41
Thank you sir, excellent explanation. It solved the problem, can I ask you one thing, as I'd mentioned I have hundreds of such files and I want to get the average in this manner in a csv file: `district_name,JanAvg,FebAvg...DecAvg`, So how can I read so many files? I am new to Python and NumPy and any resource/tutorial etc. you can point to, would be greatly helpful. And I didn't select your answer right away, because I thought this would close the thread. It's also fine if you don't know, just want to see your suggestion. — user3707588, Jun 06 '14 at 15:30
It would probably be better to ask a new question about how to approach that problem. — Warren Weckesser, Jun 06 '14 at 16:37
OK, I think there's a limit a newbie can ask questions on StackOverflow. I'd search and mine. Anyway, thank you very much. I was stuck with this since Wednesday. Selected as answer. :) — user3707588, Jun 06 '14 at 16:46

EdChum · Answer 2 · 2014-06-06T14:05:29.533

UPDATE

Thanks to Warren Weckesser for pointing out that you can pass the width values which will handle this file properly

OK, pandas reads fixed width files fine:

In [192]:

df = pd.read_fwf(r'c:\data\temp mean_andhra_adilabad.txt',skiprows=2, widths=(5,)+(7,)*12, skip_footer=1)
df
Out[192]:
    DATE   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV  \
0      1  21.5  24.3  27.1  31.3  34.1  34.5  29.0  27.5  27.1  28.0  25.8   
1      2  21.4  24.2  27.1  31.4  33.8  34.1  28.8  27.5  27.1  28.0  25.8   
2      3  21.2  24.3  27.1  31.5  34.4  34.1  28.6  27.5  27.0  28.0  25.6   
3      4  21.2  24.4  27.1  31.7  34.4  33.8  28.5  27.1  27.0  27.9  25.5   
4      5  21.4  24.6  27.6  31.7  34.4  33.5  28.2  27.0  27.1  27.8  25.4   
5      6  21.7  24.4  28.0  31.6  34.5  33.3  28.2  27.1  27.0  28.0  25.1   
6      7  21.8  24.1  28.1  31.5  34.5  32.9  28.2  27.1  27.0  27.8  25.3   
7      8  22.0  24.4  28.3  31.8  34.6  33.3  27.9  26.7  27.1  27.9  25.1   
8      9  22.0  24.5  28.3  32.2  34.6  33.1  27.8  26.6  27.2  28.1  24.8   
9     10  22.3  24.6  28.4  32.1  34.5  32.5  28.0  26.7  27.2  27.9  25.0   
10    11  22.3  24.9  28.6  32.3  34.4  32.2  27.8  26.9  27.2  28.0  25.2   
11    12  22.3  25.0  28.3  32.6  34.4  32.0  27.6  27.1  27.3  27.9  24.9   
12    13  22.5  25.1  28.6  32.7  34.5  31.4  27.8  27.1  27.5  27.8  24.8   
13    14  22.5  25.6  28.7  33.1  34.7  31.2  27.7  26.8  27.6  27.7  24.6   
14    15  22.5  25.7  29.1  33.2  34.6  31.0  27.8  27.0  27.9  27.6  24.6   
15    16  22.5  25.7  29.4  33.1  34.4  30.6  27.7  26.9  28.0  27.6  24.5   
16    17  22.5  25.8  29.5  32.8  34.6  30.1  27.8  26.8  28.1  27.2  24.3   
17    18  22.6  26.0  29.9  33.0  34.8  30.1  27.6  27.0  28.2  27.3  24.0   
18    19  22.8  25.9  30.2  33.3  34.7  30.0  27.9  27.0  28.1  27.2  24.0   
19    20  23.1  25.9  30.2  33.3  35.1  30.2  27.9  27.0  27.9  27.2  24.0   
20    21  23.1  25.8  30.2  33.5  34.9  30.1  27.8  26.9  28.0  26.9  23.8   
21    22  22.8  25.8  30.6  33.4  35.1  29.8  27.8  26.8  28.2  26.7  23.5   
22    23  22.9  25.8  30.6  33.4  35.1  29.6  27.8  26.8  28.2  26.7  23.5   
23    24  23.1  26.2  30.4  33.5  35.1  29.3  27.8  27.0  28.1  26.5  23.5   
24    25  23.4  26.5  30.2  33.5  35.1  29.2  27.6  27.3  28.1  26.5  23.3   
25    26  23.5  26.7  30.3  33.6  35.0  29.1  27.6  27.4  28.2  26.4  23.0   
26    27  23.6  26.8  30.3  33.8  35.1  28.8  27.6  27.1  28.2  26.2  23.1   
27    28  23.8  27.0  30.6  34.1  34.9  28.5  27.6  26.8  28.2  26.0  22.9   
28    29  23.4   NaN  31.0  34.3  34.8  28.5  27.4  27.0  28.1  25.8  22.9   
29    30  23.5   NaN  31.1  34.5  34.6  29.1  27.4  27.0  28.1  25.7  22.6   
30    31  23.8   NaN  31.2   NaN  34.7   NaN  27.4  27.0   NaN  25.7   NaN   

     DEC  
0   22.4  
1   22.4  
2   22.5  
3   22.5  
4   22.6  
5   22.3  
6   22.0  
7   22.0  
8   21.8  
9   21.7  
10  21.9  
11  21.9  
12  21.8  
13  21.5  
14  21.5  
15  21.5  
16  21.7  
17  21.6  
18  21.7  
19  21.7  
20  21.8  
21  21.7  
22  21.8  
23  21.8  
24  21.7  
25  21.6  
26  21.3  
27  21.3  
28  21.2  
29  21.4  
30  21.6  

In [193]:
df.mean(axis=0)
Out[193]:
DATE    16.000000
JAN     22.548387
FEB     25.357143
MAR     29.229032
APR     32.793333
MAY     34.658065
JUN     31.196667
JUL     27.890323
AUG     27.016129
SEP     27.666667
OCT     27.225806
NOV     24.346667
DEC     21.812903
dtype: float64

@WarrenWeckesser good question it dropped off for some reason — EdChum, Jun 06 '14 at 12:08
I just tried this. `read_fwf` is merging the first two columns in the file into a single column in the DataFrame with the heading `DATE JAN`. The values in this column are strings. E.g `df['DATE JAN'][0]` is `'01 21.5'`. — Warren Weckesser, Jun 06 '14 at 12:46
@WarrenWeckesser please see updated answer, I had to remove a leading space from the header line, this fixed the incorrect aligment — EdChum, Jun 06 '14 at 13:35
Giving the widths explicitly worked for me: `df = pd.read_fwf("temp_mean.txt", skiprows=2, widths=(5,)+(7,)*12)`. Then there is no need to edit the file. — Warren Weckesser, Jun 06 '14 at 13:44
Thank you sir, for sticking with me from the beginning. The only reason I didn't select this as the final answer, is because I don't know Pandas and it'd be a new task to learn in my already tightly time bound schedule. It's definitely worth looking into. And I'd when I'm done with my project. Have a nice day :) — user3707588, Jun 06 '14 at 16:48

How to read columns of varying length from a text file in NumPy using genfromtxt()?

2 Answers2