0

I have a text file which contains lines that writes !python commands with 2 arguments. For example

!python example.py 12345.mp4 12345.pkl
!python example.py 111.mp4 111.pkl
!python example.py 123.mp4 123.pkl
!python example.py 44441.mp4 44441.pkl
!python example.py 333.mp4 333.pkl
...

The thing is I want to run all those lines in notebook environment (of Microsoft Azure ML Notebook, or Google Colab). When I copy paste only a few lines (around 500) allowed to be pasted in a notebook code block and I have tens of thousands of lines. Is there a way to achieve this?

I have thought of using for loops to reproduce the text file, but I can't run !python commands inside of python for loop as far as i know.

Edit: I also feel like I have to add these mp4 files are in the same folder with the python code and my text containing those lines. So I want to run example.py for all files in a single folder, and with the argument that changes its .mp4 extension to .pkl (because that acts as name of my output from the command). Maybe now a better solution which runs faster can be made. And my example.py file can be found here: https://github.com/open-mmlab/mmaction2/blob/90fc8440961987b7fe3ee99109e2c633c4e30158/tools/data/skeleton/ntu_pose_extraction.py

  • That seems like a crazy thing to want. Is this an [XY problem](https://en.wikipedia.org/wiki/XY_problem)? – tripleee Dec 09 '22 at 18:42
  • 2
    `!sh -c 'for i in 12345 111 123; do python example.py "$i.mp4" "$i.pkl"; done'` but that still gets crazy if you have thousands of values. – tripleee Dec 09 '22 at 18:44
  • I have added a few more information about my problem. – taliegemen Dec 09 '22 at 20:15
  • `!sh -c 'for f in *.mp4; do python example.py "$f" "${f%.mp4}.pkl"; done'` but why do you want to run it in a notebook? Just put that command in a file and run it, or type it at the terminal prompt in the directory where the files are if you have access to that. (Then no need for the `sh -c '...'` wrapper; maybe it's not necessary here, either.) – tripleee Dec 10 '22 at 05:55
  • Changing the `;` before `done` to `&` will run all the processes in parallel; but running 10,000+ processes at once will seriously clog your system. Maybe look at GNU `parallel` for controlled parallelism; and generally, get acquainted with the shell, which is what you want instead of Python here, or especially a notebook for anything noninteractive. – tripleee Dec 10 '22 at 05:59

2 Answers2

0

while running thousands of python interpreters seems like a really bad design as each interpreter needs (a non-negligable amount of) time to start, you can just remove the explaimation mark and run it using os.system.

import os
with open("file.txt",'r') as f:
    for line in f:
        command = line.strip()[1:]  # remove ! and the \r\n
        os.system(command)

which will take a few months to finish if you are starting tens of thousands of python interpreters, you are much better off running all the code inside a single interpreter in multiple processes using multiprocessing if you know what the file does.

Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23
  • Thank you! I will try this asap and let you know if it works. – taliegemen Dec 09 '22 at 20:49
  • Better still, use `subprocess`. Better still, don't run Python as a subprocess of itself; `import` the code you want to run (might require some [refactoring](https://stackoverflow.com/a/69778466/874188)) and perhaps run it in parallel using `multiprocessing` or threading. – tripleee Dec 10 '22 at 05:51
  • "Months" is probably overtly pessimistic; I can run a few hundred Python scripts per second on my laptop, but then on top of that it depends on what the Python script does, obviously. The main bottleneck is probably file I/O, especially if you need to process the contents of many potentially large video files (and then adding parallelism will only clog the I/O channel, and not improve anything). – tripleee Dec 10 '22 at 06:02
  • @tripleee the "months" part is clearly sarcastic, but if opening one interpreter takes 3 second (which is about as much time it takes to load numpy and opencv) you're going to wait 8 extra hours at least, also this is clearly embarrassingly parallel, so you can get at least a speedup of two by just interleaving disk access and computation, the time of running this can be marginally reduced – Ahmed AEK Dec 10 '22 at 06:15
  • 1
    Yeah, running at least two instances in parallel probably makes sense if the processing overhead is nontrivial. – tripleee Dec 10 '22 at 06:37
  • Since the code uses cuda computing from pytorch libraries, I don't think they can be run parallel without non-trivial changes to the code. – taliegemen Dec 10 '22 at 16:17
  • @taliegemen they can run in parallel as long as you don't fill the entire GPU memory, the GPU does the same application context switching as your CPU does. – Ahmed AEK Dec 10 '22 at 16:34
  • Ultimately, without knowledge about your payloads, all we can provide is general advice. If performance optimizations are important, measure how much you can parallelize. If one payload fills up your GPU and/or CPU and/or I/O bandwidth completely, you basically can't. – tripleee Dec 10 '22 at 17:24
  • Thank you for the answer! I used the first one with subprocess since it does not require me to generate my txt file before running my commands. But this also satisfied what I was looking for, so thank you again for your kind answer! – taliegemen Dec 22 '22 at 18:09
0

What you are asking seems misdirected. Running the commands specifically in a notebook only makes sense if each command produces some output which you want to display in the notebook; and even then, if there are more than a few, you want to automate things.

Either way, a simple shell script will easily loop over all the files.

#!/bin/sh
for f in *.mp4; do
    python example.py "$f" "${f%.mp4}.pki"
done

If you really insist on running the above from a notebook, saving it in a file (say, allmp4) and running chmod +x on that file will let you run it with ! at any time (simply ! ./allmp4).

(The above instructions are OS-dependent; if you are running your notebook on Windows, the commands will be different, and sometimes bewildering to the point where you probably want to remove Windows.)

Equivalently, anything you can put in a script can be run interactively; depending on the exact notebook, you might not have access to a full shell in ! commands, in which case you can get one with sh -c '... your commands ...'. In general, newlines can be replaced with semicolons in shell scripts, though there are a few contexts where newlines translate to just whitespace (like after then and do).

Quite similarly, you can run python -c '... your python code ...' though complex Python code is hard to serialize into a one-liner. But then, your notebook already runs Python, so you can just put your loop in a cell, and run that.

from pathlib import Path
import subprocess

for f in Path(".").glob("*.mp4"):
    subprocess.run(
        ["python", "example.py",
         str(f), str(f.with_suffix(".pkl"))],
        check=True, text=True)

... though running Python as a subprocess of itself is often inefficient and clumsy; if you can import example and run its main function directly, you have more control (in particular around error handling), and more opportunities to use multiprocessing or other facilities for parallel processing etc. If this requires you to refactor example.py somewhat, perhaps weigh reusability against immediate utility - if the task is a one-off, getting it done quickly might be more important than getting it done right.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Each of the codes basically makes pose extraction with running 2 deep learning architectures one after another, which needs lots of gpu power and vram which my computer does not have. Also, I have a azure student credits and a well built compute unit on machine learning service so migrating to something without notebook will take time. – taliegemen Dec 10 '22 at 16:24
  • Then don't parallelize. The bulk of this still holds. The kernel which runs your `!` commands is the same kernel which runs your notebook cells so the `!` detour seems like an unnecessary diversion. – tripleee Dec 10 '22 at 17:27
  • Ok, I will try the first answer and second answer and let you know how that went. Since I am in a data gathering process right now it will take some time for me to try those (a week or so). Also, yeah I don't believe these will run in parallel since I believe one command uses minimum 4 gigs of vram. – taliegemen Dec 11 '22 at 19:09
  • I have managed to test this code, and it runs flawlessly! But since my code might return with errors on some occasions I made the change of "check=True" to "check=False". – taliegemen Dec 22 '22 at 18:04