2

I have a dataframe pd1 got with pandas

pd1 = pd.read_csv(r'c:\am\wiki_stats\topandas.txt',sep=':',
                  header=None, names  = ['date-time','domain','requests-qty','response-bytes'],
                   parse_dates=[1], converters={'date-time': to_datetime}, index_col = 'date-time')

with index

>> pd1.index:  

 DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                ...
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00'],
               dtype='datetime64[ns]', name='date-time', length=6084158, freq=None)

But when I want to set index to that colomn, I get error as below (I initially wanted to set multiple columns index, that error appeared, then tried to created other dataframe from it pd_new_index = pd1.set_index(['requests-qty','domain']) with other columns as index (ok) and to make new frame also setting index to 'date-time' column back pd_new_2 = pd_new_index.set_index(['date-time']) - same error). 'date-time' does not look like special keyword and also that column is index now. Why error?

KeyError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2656 try: -> 2657 return self._engine.get_loc(key) 2658 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date-time'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) in ----> 1 pd_new_2 = pd_new_index.set_index(['date-time'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity) 4176 names.append(None) 4177 else: -> 4178 level = frame[col]._values 4179 names.append(col) 4180 if drop:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key) 2925 if self.columns.nlevels > 1: 2926 return self._getitem_multilevel(key) -> 2927 indexer = self.columns.get_loc(key) 2928 if is_integer(indexer): 2929 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2657
return self._engine.get_loc(key) 2658 except KeyError: -> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2660
indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2661 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date-time'

Mohit Motwani
  • 4,662
  • 3
  • 17
  • 45
Alex Martian
  • 3,423
  • 7
  • 36
  • 71

1 Answers1

1

Reason is date-time is already index, here DatetimeIndex, so not possible select it like columns by names.

Reason is parameter index_col:

pd1 = pd.read_csv(r'c:\am\wiki_stats\topandas.txt',
                  sep=':',
                  header=None, 
                  names  = ['date-time','domain','requests-qty','response-bytes'],
                  parse_dates=[1], 
                  converters={'date-time': to_datetime}, 
                  index_col = 'date-time')

For MultiIndex add list of columns names in index_col, remove converters and specify column name in parse_dates parameter:

import pandas as pd
from io import StringIO

temp=u"""2016-01-01:d1:0:0
2016-01-02:d2:0:1
2016-01-03:d3:1:0"""
#after testing replace 'pd.compat.StringIO(temp)' to r'c:\am\wiki_stats\topandas.txt''
df = pd.read_csv(StringIO(temp), 
                 sep=':',
                 header=None, 
                 names  = ['date-time','domain','requests-qty','response-bytes'],
                 parse_dates=['date-time'], 
                 index_col = ['date-time','domain'])

print (df)

date-time  domain                              
2016-01-01 d1                 0               0
2016-01-02 d2                 0               1
2016-01-03 d3                 1               0

print (df.index)
MultiIndex([('2016-01-01', 'd1'),
            ('2016-01-02', 'd2'),
            ('2016-01-03', 'd3')],
           names=['date-time', 'domain'])

EDIT1: Solution with append parameter in set_index:

import pandas as pd
from io import StringIO


temp=u"""2016-01-01:d1:0:0
2016-01-02:d2:0:1
2016-01-03:d3:1:0"""
#after testing replace 'pd.compat.StringIO(temp)' to r'c:\am\wiki_stats\topandas.txt''
df = pd.read_csv(StringIO(temp), 
                 sep=':',
                 header=None, 
                 names  = ['date-time','domain','requests-qty','response-bytes'],
                 parse_dates=['date-time'], 
                 index_col = 'date-time')

print (df)
           domain  requests-qty  response-bytes
date-time                                      
2016-01-01     d1             0               0
2016-01-02     d2             0               1
2016-01-03     d3             1               0

print (df.index)
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03'], 
              dtype='datetime64[ns]', name='date-time', freq=None)

df1 = df.set_index(['domain'], append = True)
print (df1)
                   requests-qty  response-bytes
date-time  domain                              
2016-01-01 d1                 0               0
2016-01-02 d2                 0               1
2016-01-03 d3                 1               0

print (df1.index)
MultiIndex([('2016-01-01', 'd1'),
            ('2016-01-02', 'd2'),
            ('2016-01-03', 'd3')],
           names=['date-time', 'domain'])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • How do I add other column to index to make index like `pd1.set_index(['date-time','domain'])`? – Alex Martian Sep 03 '19 at 11:42
  • I understood I can append, can't i? `pd_new_index4 = pd1.set_index(['domain'], append = True)` when after that command I run `pd_new_index_v4.head(5)` it shows two first column names below others - like only first before. But `print (pd_new_index_v4.index)` gives nothing and after some other clicks I have `insufficient memory to display page` something error in jupyter. That is another issue I suppose. But append should work? – Alex Martian Sep 03 '19 at 12:02
  • 1
    @AlexeiMartianov - I think `pd_new_index4 = pd1.set_index(['domain'], append = True)` is good solution, what return `print (pd_new_index_v4.index)` ? It is nothing? It is weird – jezrael Sep 03 '19 at 12:03
  • I guess it's low memory issue, my dataset could be considered large (200Mb text file). Or it is not that large? How do I know maybe Jupyter is just lagging? – Alex Martian Sep 03 '19 at 12:05
  • @AlexeiMartianov - hmmm, it is possible, because working nice (EDIT1), not working my second solution with list for `index_col`? – jezrael Sep 03 '19 at 12:07
  • both prints result in insufficient memory. What index does except make output of index? Is it that command: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html? – Alex Martian Sep 04 '19 at 06:54
  • @AlexeiMartianov - More infor about index - [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) and about [`MulitIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) – jezrael Sep 04 '19 at 06:55
  • thank you! also I could not quickly web-search u""" and usage of it in stringio (https://docs.python.org/2/library/stringio.html - very short and strangely only version 2 of python, u is unicode, but """ i could not find, why result of stringIO is that particular 3 lines also as description on link is short). Could you please point me also? – Alex Martian Sep 04 '19 at 09:19
  • 1
    @AlexeiMartianov - sure, it is call [multi line string](https://stackoverflow.com/questions/10660435/pythonic-way-to-create-a-long-multi-line-string), and `u` is unicode used for python 2, now it should be removed, because python 3 support unicode – jezrael Sep 04 '19 at 09:21