1

In python pandas I have a dataframe

df_aaa:

date     data  otherdata symbol
2015/1/1 11    12         aaa
2015/2/1 21    22         aaa
2015/3/1 31    31         aaa

df_all:

2015/1/1 31    31         bbb

Currently the index of both is by date .

I want to append df_aaa to df_all, and have them with a composite index of both symbol and date.

  1. How do I do that?

Basically the following are all one question: How do I set a multi-index and use it when appending. Can I do it with different column order? Do I need to refresh? Etc.:

  • I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?

  • Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols

  • Do I have to "drop" the existing index before creating the new one?

  • Is there such a thing as a dataframe with data but no index?

  • Must a (single) index be of unique values?

  • When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?

  • And what is the default when I give these methods an array. Python docs are daunting, and I can't find my hands or legs in them. Giving some good examples would help...
  • Do I have to add anything (like axis=1) when setting the index?

  • How do I set the index to be the data in a column. (And why does sometimes using ['symbol', 'date'] as a parameter, give me a new column with those two values, instead of setting the index on the existing values of the columns with those two names?)

  • After I append and assuming the old index is correct do I need to 'update' the index (perhaps using reindex?) or since I told the dataframe that the index is in a certain column, is my data correctly indexed?

  • And since my dataframes (will) have indices on the same column name, can I do an append of df_aaa on df_all even if df_all was defined to have the columns originally in a different order. (say: ['symbol', 'date', 'data', 'otherdata'] with symbol the first column)?

pashute
  • 3,965
  • 3
  • 38
  • 65

3 Answers3

4

You can just concatenate them and then set the index.

df_aaa = df_aaa.reset_index()
df_all = df_all.reset_index()

df = df_aaa.append(df_all).set_index(['symbol', 'date'])

Note that this would work only if your dataframes have the same column.s

If you must perform multiple appends in the future, the best thing to do would be to get one of them in the shape of the other, perform the concatenation, and reset index as needed.


I'll answer all your questions one by one.

I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?

It depends on what axis you're referring to. Along the row (0th axis), you have 2 or more columns forming a MultiIndex. Similarly for along the columns (1st axis).

Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols

No need. Although you could, not doing so would be simpler in this case.

Do I have to "drop" the existing index before creating the new one? No, just that the columns must align (column name and number of columns should be the same).

Is there such a thing as a dataframe with data but no index?

No. All rows are indexed. Even if there is no column as the index, the index is a monotonically increasing number. The model followed here is similar to that in RDBMs.

Must a (single) index be of unique values?

In general, the must, so rows can be uniquely identified. If you have a MultiIndex, each combination of values that make up the index must be unique.

When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?

This is a broad question. It depends, when do you want to operate on the index and if so, what do you want to do with it? Look at the documentation for each one carefully.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Note that in the question both dataframes CURRENTLY HAVE A SET INDEX of date. and not the auto number. About the various dataframe functions: I meant in the context of this question: In all of them I have "level=" and "columns=" and I don't understand what the default of ['symbol', 'date'] does. I did the set_index after the merge and it DID NOT set the column correctly. – pashute Nov 08 '17 at 16:12
  • @pashute What is the current index of both your dataframes? I've tried my best to help you despite the limited context your question gives. – cs95 Nov 08 '17 at 16:13
  • @coldspeed thank you till now! But a. Your solution doesn't work, if I have the date index. It gave me 2 rows with na's. the index column had no name and on row 1 had 'symbol' and row 2 'date'. – pashute Nov 08 '17 at 16:21
  • @pashute Call `reset_index` first, and then concatenate. See edit. – cs95 Nov 08 '17 at 16:23
3

Just append df's and reset_index() to be able to set_index() with keys argument. Here's oneliner:

df_all = df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])

And here is full working sample.

In [1]: import pandas as pd
   ...: from io import StringIO
   ...: 

In [2]: df_aaa = pd.read_csv(StringIO("""date     data  otherdata symbol
   ...: 2015/1/1 11    12         aaa
   ...: 2015/2/1 21    22         aaa
   ...: 2015/3/1 31    31         aaa
   ...: """), sep="\s+", index_col='date')
   ...: 

In [3]: df_all = pd.read_csv(StringIO("""date     data  otherdata symbol
   ...: 2015/1/1 31    31         bbb"""), sep="\s+", index_col='date')
   ...: 

In [4]: df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
Out[4]: 
                 data  otherdata
symbol date                     
bbb    2015/1/1    31         31
aaa    2015/1/1    11         12
       2015/2/1    21         22
       2015/3/1    31         31
tworec
  • 4,409
  • 2
  • 29
  • 34
  • In what way, shape, or form is your answer different from mine, posted almost 40 minutes later? – cs95 Nov 08 '17 at 16:49
  • i'm providing full working sample. you were faster though :) – tworec Nov 08 '17 at 16:50
  • I don't see how that makes a difference. Besides, OP conceded that it doesn't work as it is (a small change is needed, see my answer) – cs95 Nov 08 '17 at 16:51
  • oh, you're right. I didnt saw this conversation till now – tworec Nov 08 '17 at 16:56
  • @cᴏʟᴅsᴘᴇᴇᴅ kudos+1 for you – tworec Nov 08 '17 at 17:01
  • Thanks, returned the gesture. – cs95 Nov 08 '17 at 17:01
  • 1
    Thank you tworec! Without your answer I wouldn't understand Coldspeed's which still states: Do I have to "drop" the existing index before creating the new one? No, just that the columns must align (column name and number of columns should be the same). The reset_index of course does exactly that. Neither of you replied to the question of re-indexing, nor to the question of index vs. column. I'll accept ColdSpeed's answer but put another answer of my own with the rest as far as I understand, for us programmers. – pashute Nov 08 '17 at 23:36
1

Here is what I gather from the answers and dragging through the docs:

There is a "default index" which is a "row-number" for each row, and which is not part of any of the columns.

When merging with that index, there (seems to be) no need to re-index.

But if I want to change the index after it was made "non-standard" I have to "reset_index()" and turn it back to the default, and then from there I can create the new multi index (as explained in the revisioned answer below)

A multi-index is one that has more than one key (i.e. if indexing the rows, then more than one column will be used).

I'm still not sure if you have to re-index a column after a merge, but according to this it seems you get an automatically generated new "default index" and have to save the old one, remove the index before merge (reset_index) and set it again when done.

The other question about the index replacing a column - I'll check and get back here.

This is a follow-up.

pashute
  • 3,965
  • 3
  • 38
  • 65