2

Let's say I have a script that reads data into a dataframe from a database, runs some logic on that dataframe, and then exports the resulting dataframe into another database table like below. The issue is that the dataframe in transform.py isn't getting overwritten after the exec function.

NOTE: This is a simple example to demonstrate the issue, not an actual problem I'm trying to solve using this approach.

Expectation:

Before exec

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

After exec

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

Actual:

Before exec

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

After exec

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

They are the same!

transform.py

def dataframe_transform(logic, source_table, dest_table, database, existing_rows='truncate'):
    ...
    df = table_to_df(table=source_table, database=database)

    try:
        exec(logic)
    except Exception:
        raise

    result = df_to_table(dataframe=df, database=database, table=dest_table, existing_rows=existing_rows)

    return result

The logic filters out the dataframe to look for records that need to be updated and kicks off another process and overwrites the original dataframe with the new filtered data.

logic.py

# This is just an example I made up - please don't focus on solving this.

late_df = pd.DataFrame()

# Check if data is late
late_cutoff = 40
for index, row in df.iterrows():
    if row['datediff'] >= late_cutoff:
        late_df = late_df.append(row, ignore_index=True)

... # Do something else

df = late_df # Save flagged records by updating the original dataframe.

Why would I do this? In this case, I know that the input is secure and it allows me to reuse this code for various scripts and separate out the transform logic.

Jonathan Porter
  • 1,365
  • 7
  • 34
  • 62
  • 1
    what prevents `logic.py` from being a stand-alone function. I do not think this is an appropriate use of `exec` at all. – modesitt May 03 '19 at 17:29
  • 4
    You still should not be doing that to begin with. [Appending to a DataFrame within a loop is very bad](https://stackoverflow.com/a/37009561/4333359) and you [should not iterate over the rows](https://stackoverflow.com/a/55557758/4333359) – ALollz May 03 '19 at 17:31
  • 2
    @ALollz especially when this can be done in a single line with a mask or a variety of other ways. `df.loc[df['datediff'] >= late_cutoff]` – modesitt May 03 '19 at 17:32
  • 1
    Again it isn't about the example - I just made it up. The point is I'm wondering why the df doesn't get updated after the exec. – Jonathan Porter May 03 '19 at 17:37
  • I think @harvey is right and it is going to be a scope issue, but it's hard to help here b/c this isn't an MCVE: https://stackoverflow.com/help/mcve Please show us a complete example with sample data that can be copied and pasted and run. – JohnE May 06 '19 at 20:19

4 Answers4

5

Check your scope. Its impossible to tell from the code provided, but I suspect your exec call is not managing scope (locals, globals) correctly. "In Python 3, exec is a function; its use has no effect on the compiled bytecode of the function where it is used." (from What's the difference between eval, exec, and compile?)

see also: https://www.programiz.com/python-programming/methods/built-in/exec

Personal Opinion: eval/exec are evil and to be avoided.

One last point that has been expressed in the comments by others. The code example shows that you are still thinking in rows and mixing vectors (df['col']) with scalars (late_cutoff) in row-based operations (for x in itterrows) This is a common issue with pandas users, and I do a lot of refactoring on this type of problem for others. Your programs will be an order of magnitude faster if you can change your code to use pandas that way it is designed to work: without loops, and without changing the original data. Read once - create a new dataframe with the altered data without using iterators - Write once. if you must loop, create a set of keys and iterate over the set to create a vectorized operation:

keys = set(df['key_col'])
for key in keys:
    dfx = df[df[key > limit]]

This too may be useful to you (see the write many logic for improving write speed) Bulk Insert A Pandas DataFrame Using SQLAlchemy

Harvey
  • 617
  • 8
  • 18
  • I think you are right, this is a scope issue. Note that `exec` takes locals and globals as parameters -- it's in the link you provide but you might want to make that explicit in your answer. – JohnE May 06 '19 at 20:20
1

It is mixed question on python and database. I think you might need check both python and database.

Assuming, you using mysql. (I noticed the check part looks like mysql client output) expected read from source_table at a DB into dataframe, and then change the dataframe aby ogic.py. write it by to dest_table. you expect the 1. before/after should be different.

My question: 1. before/after source table should be same. is it you expectation? is it possible, checking part wrong: compared the before/after source table. as I mentioned it should be same.

  1. before/after target table should be different. if you expectation is different, how can you make sure the df_table process completed as you expection? is it possible, passed a wrong target table name in parameter and change the wrong table. but still check the target name

  2. before/after target table not changed and the returned result code did not show it.

To solve the issue, I think trace point or log should be effective and efficient tools

Good luck

Yong Wang
  • 1,200
  • 10
  • 15
1

Like everyone already mentioned you should prefer to simply import your code, but I assume that you are dynamically building a "string of code" somehow, so that's not possible for you.

Are you sure that you really use exec(logic)? In the simple example I tested it works fine. Maybe you are using exec(logic, globals(), locals()) or exec(logic, globals())? It that case exec uses "a copy of locals/globals" so it does not update real "locals" of the current scope.

logic = """
late_df = df.replace('a', 'x')
df = late_df
"""


def simple_exec_transform(df):
    df = df + 'opq'
    try:
        exec(logic)
    except Exception:
        raise

    return df


def bad_transform(df):
    df = df + 'opq'
    try:
        exec(logic, globals(), locals())
    except Exception:
        raise

    return df


def run_logic(locals_dict, return_variable='df'):
    exec(logic, globals(), locals_dict)
    if return_variable not in locals_dict:
        raise NameError("return_variable is not available in locals after exec(logic)")
    return locals_dict[return_variable]


def controlled_locals_exec_transform(df):
    df = df + 'opq'
    try:
        df = run_logic({'df': df})
    except Exception:
        raise
    return df


print(simple_exec_transform('abcdef'))
print(bad_transform('abcdef'))
print(controlled_locals_exec_transform('abcdef'))
# xbcdefopq
# abcdefopq
# xbcdefopq

imposeren
  • 4,142
  • 1
  • 19
  • 27
  • 1
    If your `logic` starts a subprocess and performs `df = late_df` in the subprocess, then it also will not change anything. – imposeren May 13 '19 at 11:40
0

I know its tempting to take code fragments as input by you have an interpreter handy. You could optionally import different logic modules at runtime depending on your need.

if predicate1:
    import logic_one as logic
elif predicate2:
    import logic_two as logic

logic.my_logic_operations_on_dataframe(df)
Mike Robins
  • 1,733
  • 10
  • 14