0

I have three very long lists of Pandas data frames. For example:

list_a = [tablea1, tablea2, tablea3, tablea4]

list_b = [tableb1, tableb2, tableb3, tableb4]

list_c = [tablec1, tablec2, tablec3, tablec4]

I want to do something like this:

tablea1 = pd.concat([tablea1, tableb1, tablec1], axis=1)

So naively, I wrote such codes:

for i in range(len(list_a)):

    list_a[i] = pd.concat([list_a[i], list_b[i], list_c[i]], axis=1)

This code failed to work, b/c list_a[0] is a reference to tablea1 initially, then inside the loop, list_a[0] will be re-assigned to point to

pd.concat([tablea1, tableb1, tablec1], axis=1), 

which is a new object. In the end, tablea1 is not modified. (list_a does contain the desired result. But I do want to modify tablea1.) I have spent hours on this and cannot find out a solution. Any help? Thanks.

qqzj
  • 21
  • 1
  • 6
  • Don't think this works in python. you may have to force `tablea1=list[0]` with a loop of some sorts – zglin Sep 08 '16 at 22:20
  • This won't work for the same reason. Once I put tablea1, 2, 3 inside a list and loop over the list, the re-assignment would simply change the object that is referenced. The initial tables won't be touched. – qqzj Sep 08 '16 at 22:26
  • I meant to say that after all of your code runs, you would have to go through a separate exercise assigning each of the list_x[i] values back to the tableXn variables (not ideal, I know). – zglin Sep 08 '16 at 22:45
  • As long as it is another loop, the same problem comes back, right? – qqzj Sep 09 '16 at 00:07
  • What was the error you received? Please show traceback not just your interpretation of the error. – Parfait Sep 09 '16 at 02:18
  • http://stackoverflow.com/a/8989916/624829 – Zeugma Sep 09 '16 at 03:35
  • The code runs okay. No error. But the result is not what I wanted. I wanted to modify tables like tablea1. This objective is not achieved. All the results I want is saved in list_a. – qqzj Sep 09 '16 at 04:25

2 Answers2

0

@qqzj The problem that you will run into is that python doesn't exactly have this feature available. As @Boud mentions, the reference to tablea1, tableb1, tablec1 etc are lost after concatenation.

I'll illustrate a quick and dirty example of the workaround (which is very inefficient, but will get the job done).

Without your data, I'm basically creating random data frames.

tablea1 = pd.DataFrame(np.random.randn(10, 4))
tableb1 = pd.DataFrame(np.random.randn(10, 4))
tablec1 = pd.DataFrame(np.random.randn(10, 4))

tablea2 = pd.DataFrame(np.random.randn(10, 4))
tableb2 = pd.DataFrame(np.random.randn(10, 4))
tablec2 = pd.DataFrame(np.random.randn(10, 4))

Applying your code to iterate over this list

list_a = [tablea1, tablea2]
list_b = [tableb1, tableb2]
list_c = [tablec1, tablec2]
for i in range(len(list_a)):
    list_a[i] = pd.concat([list_a[i], list_b[i], list_c[i]], axis=1)

Once you run a compare here, you see the issue that you have highlighted, namely that while list_a[i] has been concatenated with tablea1, tableb1, and tablec1, this hasn't been assigned back to tablea1.

As I mentioned in the comment, the answer is to assign tablea1 with the list[0]

tablea1=list_a[0]

You would repeat this for tablea2 tablea3 etc.

Doing the compare, you can see now that tablea1 matches the values in list[0]

tablea1==list_a[0]

      0     1     2     3     0     1     2     3     0     1     2     3
0  True  True  True  True  True  True  True  True  True  True  True  True
1  True  True  True  True  True  True  True  True  True  True  True  True
2  True  True  True  True  True  True  True  True  True  True  True  True
3  True  True  True  True  True  True  True  True  True  True  True  True
4  True  True  True  True  True  True  True  True  True  True  True  True
5  True  True  True  True  True  True  True  True  True  True  True  True
6  True  True  True  True  True  True  True  True  True  True  True  True
7  True  True  True  True  True  True  True  True  True  True  True  True
8  True  True  True  True  True  True  True  True  True  True  True  True
9  True  True  True  True  True  True  True  True  True  True  True  True

Again this is not the ideal solution, but what you are looking for doesn't seem to be the 'pythonic' way.

Community
  • 1
  • 1
zglin
  • 2,891
  • 2
  • 15
  • 26
  • Suppose I have 100 tables like tablea1, ... tablea100. I want to batch process these tables so that I do not have to write concat function 100 times. The proposed solution you gave essentially requires me to write tablea1 = list_a[0] 100 times. This totally defeat the purpose. In fact, I have found a workaround before. I can use strings to construct the command and run it with the exec command. But once my overall function has a sub function, this workaround failed. – qqzj Sep 10 '16 at 16:27
0

Thanks for zhqiat's sample codes. Let me expand a bit on it. Here this problem can be solved using exec statement.

import pandas as pd
import numpy as np

tablea1 = pd.DataFrame(np.random.randn(10, 4))
tableb1 = pd.DataFrame(np.random.randn(10, 4))
tablec1 = pd.DataFrame(np.random.randn(10, 4))

tablea2 = pd.DataFrame(np.random.randn(10, 4))
tableb2 = pd.DataFrame(np.random.randn(10, 4))
tablec2 = pd.DataFrame(np.random.randn(10, 4))

list_a = [tablea1, tablea2]
list_b = [tableb1, tableb2]
list_c = [tablec1, tablec2]

for i in range(1, len(list_a)+1):
    exec 'tablea' + str(i) + ' = pd.concat([tablea' + str(i) + ', ' + 'tableb' + str(i) + ', ' +  'tablec' + str(i) + '], axis=1)'

print tablea1

I have been using this approach for a while. But after the code got more complicated. exec started complaining

'SyntaxError: unqualified exec is not allowed in function 'function name' it contains a nested function with free variables'. 

Here is the troubled codes:

def overall_function():

    def dummy_function():
        return True

    tablea1 = pd.DataFrame(np.random.randn(10, 4))
    tableb1 = pd.DataFrame(np.random.randn(10, 4))
    tablec1 = pd.DataFrame(np.random.randn(10, 4))

    tablea2 = pd.DataFrame(np.random.randn(10, 4))
    tableb2 = pd.DataFrame(np.random.randn(10, 4))
    tablec2 = pd.DataFrame(np.random.randn(10, 4))

    list_a = ['tablea1', 'tablea2']
    list_b = ['tableb1', 'tableb2']
    list_c = ['tablec1', 'tablec2']

    for i, j, k in zip(list_a, list_b, list_c):
        exec(i + ' = pd.concat([' + i + ',' + j + ',' + k + '], axis=1)')


    print tablea1

overall_function()

This code will generate the error message. The funny thing is that there is no other 'def' statement in my real function at all. So I have no nested function. I am very puzzled why I got such an error message. My question is whether there is a way to ask Python telling me which variable is the culprit, i.e. the free variable that cause the problem? Or, which sub function is the responsible for the failure of my code. Ideally, for this example, I wish I could force python to tell me that dummy_function is the cause.

qqzj
  • 21
  • 1
  • 6