Regex to capture all import statements

Question

I want to create a script that looks inside a Python file and finds all import statements. Possible variations of those are the following:

import os
import numpy as np
from itertools import accumulate
from collections import Counter as C
from pandas import *

By looking at these, one could argue that the logic should be:

Get me all <foo> from from <foo> statements and those <bar> from import <bar> that are not preceded by from <foo>.

To translate the above in regex, I wrote:

from (\w+)|(?<!from \w+)import (\w+)

The problem seems to be with the non-fixed width of the negative lookbehind but I cannot seem to be able to fix it.

EDIT:

As a bonus, it would also be nice to capture multiple includes as in:

import sys, glob

Do you mean you want to only get them at the start of a line? `^(?:from\s+(\w+)|import\s+(\w+))`? Or even `^(?:from|import)\s+(\w+)` (not sure you want to differentiate between the two types)? — Wiktor Stribiżew, Dec 05 '18 at 12:27
@WiktorStribiżew Not necessarily. They might be inside functions or classes or if statements, etc. and thus indented. — Ma0, Dec 05 '18 at 12:27
Ok, but that also means you may have `s="""this is a present from me"""` and it will also get matched. Moreover, there are `import re, csv, pandas` like imports. — Wiktor Stribiżew, Dec 05 '18 at 12:28
@WiktorStribiżew [Why would that get matched?](https://regex101.com/r/dadbF2/1). Oh yeah, I need that too. Thanks! — Ma0, Dec 05 '18 at 12:29
i think your problem is related to https://stackoverflow.com/questions/44988487/regex-to-parse-import-statements-in-python — Vaibhav gusain, Dec 05 '18 at 12:30
[Because you do not want to only match at the start](https://regex101.com/r/dadbF2/2). — Wiktor Stribiżew, Dec 05 '18 at 12:30
@WiktorStribiżew I see. We could easily modify it to look only for whitespace before though, right? — Ma0, Dec 05 '18 at 12:32
Do you mean `^\s*(?:from|import)\s+(\w+)` ([demo](https://regex101.com/r/QVt56S/1))? — Wiktor Stribiżew, Dec 05 '18 at 12:45
@WiktorStribiżew Exactly. I would accept that as an answer if you feel comfortable enough posting it. Thanks btw! — Ma0, Dec 05 '18 at 12:52
But what about `import re, csv, pandas`? Maybe `^\s*(?:from|import)\s+(\w+(?:\s*,\s*\w+)*)` will be better? See [this demo](https://regex101.com/r/QVt56S/2). You will need to post-process the match then, though. — Wiktor Stribiżew, Dec 05 '18 at 12:56

score 3 · Accepted Answer · answered Dec 05 '18 at 15:05

It seems you only want to extract the matches from the start of a line, taking into account the leading whitespace.

You may consider using

^\s*(?:from|import)\s+(\w+(?:\s*,\s*\w+)*)

See the regex demo.

Details

^ - start of string (use re.M to also match start of a line)
\s* - 0+ whitespaces (use [^\S\r\n]* to only match horizontal whitespace)
(?:from|import) - either of the two words
\s+ - 1+ whitespaces
(\w+(?:\s*,\s*\w+)*) - 1 or more word chars, followed with 0+ occurrences of 0+ whitespaces, ,, 0+ whitespaces and then 1+ word chars.

In Python, you may later split the Group 1 value with re.split(r'\s*,\s*', group_1_value) to get individual comma-separated module names.

Regex to capture all import statements

1 Answers1