Determine all files read into and written from an ipython notebook

Question

This is a generalization to this question: Way to extract pickles coming in and out of ipython / jupyter notebook

At the highest level, I'm looking for a way to automatically summarize what goes on in an ipython notebook. One way I see of simplifying the problem is treat all the data manipulations that on inside the notebook as a blackbox, and only focus on what its inputs and outputs are. So, is there a way given the filepaths to an ipython notebook how can you easily determine all the different files/websites it reads into memory and subsequently also all the files that it later writes/dumps? I'm thinking maybe there could be a function that scans the file, parses it for inputs and outputs, and saves it into a dictionary for easy access:

summary_dict = summerize_file_io(ipynb_filepath)

print summary_dict["inputs"] 
> ["../Resources/Data/company_orders.csv", "http://special_company.com/company_financials.csv" ]

print summary_dict["outputs"]
> ["orders_histogram.jpg","data_consolidated.pickle"]

I'm wondering how to do this easily beyond just pickle objects to include different formats like: txt, csv, jpg, png, etc... and also which may involve reading data directly from the web into the notebook itself.

You can pickle entire python interactive session using dill: http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill.html — denfromufa, Mar 14 '17 at 18:15
@denfromufa Perhaps I'm missing something ... How would that help explain what's happening in the notebook? — Afflatus, Mar 14 '17 at 20:48
You probably won't be able to do this without actually executing the notebook. You can replace __builtin__.open as in http://stackoverflow.com/a/2023709/464289, and if you use the same function to download files, you can just replace that one with a call that logs what you want, and then downloads the file. — JRG, Mar 15 '17 at 16:08
@CharlieG - On pickling: http://stackoverflow.com/questions/7501947/understanding-pickling-in-python — Afflatus, Mar 18 '17 at 13:30

score 4 · Accepted Answer · edited May 23 '17 at 10:29

You can check what files you have opened or modified by patching the builtin open as JRG suggested and you should extend this functionality to patch any functions you use to connect to websites if you want to track that as well.

import builtins


modified = {}
old_open = builtins.open


def new_open(name, mode='r', *args, **kwargs):
    modified[name] = mode
    return old_open(name, mode=mode, *args, **kwargs)


# patch builtin open
builtins.open = new_open


# check modified
def whats_modified():
    print('Session has opened/modified the following files:')
    for name in sorted(modified):
        mode = modified[name]
        print(mode.ljust(8) + name)

It we execute this in the interpreter (or use it as a module), we can see what we've modified and how we opened it.

In [4]: with open('ex.txt') as file:
   ...:     print('ex.txt:', file.read())
   ...:     
ex.txt: some text.



In [5]: with open('other.txt', 'w') as file:
   ...:     file.write('Other text.\n')
   ...:     

In [6]: whats_modified()
Session has opened/modified the following files:
r       ex.txt
w       other.txt

This is somewhat limited though, as the mode will be overwritten when a file is reopened, but that can be fixed with some extra checks performed in new_open.

is there was to do this at a more global level. let's say you use this in a python script that call a script which opens a file — BND, Nov 18 '22 at 11:17

Determine all files read into and written from an ipython notebook

1 Answers1