1

I want to create a script that looks inside a Python file and finds all import statements. Possible variations of those are the following:

import os
import numpy as np
from itertools import accumulate
from collections import Counter as C
from pandas import *

By looking at these, one could argue that the logic should be:

Get me all <foo> from from <foo> statements and those <bar> from import <bar> that are not preceded by from <foo>.

To translate the above in regex, I wrote:

from (\w+)|(?<!from \w+)import (\w+)

The problem seems to be with the non-fixed width of the negative lookbehind but I cannot seem to be able to fix it.

EDIT:

As a bonus, it would also be nice to capture multiple includes as in:

import sys, glob
Ma0
  • 15,057
  • 4
  • 35
  • 65
  • Do you mean you want to only get them at the start of a line? `^(?:from\s+(\w+)|import\s+(\w+))`? Or even `^(?:from|import)\s+(\w+)` (not sure you want to differentiate between the two types)? – Wiktor Stribiżew Dec 05 '18 at 12:27
  • @WiktorStribiżew Not necessarily. They might be inside functions or classes or if statements, etc. and thus indented. – Ma0 Dec 05 '18 at 12:27
  • Ok, but that also means you may have `s="""this is a present from me"""` and it will also get matched. Moreover, there are `import re, csv, pandas` like imports. – Wiktor Stribiżew Dec 05 '18 at 12:28
  • @WiktorStribiżew [Why would that get matched?](https://regex101.com/r/dadbF2/1). Oh yeah, I need that too. Thanks! – Ma0 Dec 05 '18 at 12:29
  • 2
    i think your problem is related to https://stackoverflow.com/questions/44988487/regex-to-parse-import-statements-in-python – Vaibhav gusain Dec 05 '18 at 12:30
  • [Because you do not want to only match at the start](https://regex101.com/r/dadbF2/2). – Wiktor Stribiżew Dec 05 '18 at 12:30
  • @WiktorStribiżew I see. We could easily modify it to look only for whitespace before though, right? – Ma0 Dec 05 '18 at 12:32
  • Do you mean `^\s*(?:from|import)\s+(\w+)` ([demo](https://regex101.com/r/QVt56S/1))? – Wiktor Stribiżew Dec 05 '18 at 12:45
  • @WiktorStribiżew Exactly. I would accept that as an answer if you feel comfortable enough posting it. Thanks btw! – Ma0 Dec 05 '18 at 12:52
  • But what about `import re, csv, pandas`? Maybe `^\s*(?:from|import)\s+(\w+(?:\s*,\s*\w+)*)` will be better? See [this demo](https://regex101.com/r/QVt56S/2). You will need to post-process the match then, though. – Wiktor Stribiżew Dec 05 '18 at 12:56
  • @WiktorStribiżew it keeps getting better.. – Ma0 Dec 05 '18 at 13:26

1 Answers1

3

It seems you only want to extract the matches from the start of a line, taking into account the leading whitespace.

You may consider using

^\s*(?:from|import)\s+(\w+(?:\s*,\s*\w+)*)

See the regex demo.

Details

  • ^ - start of string (use re.M to also match start of a line)
  • \s* - 0+ whitespaces (use [^\S\r\n]* to only match horizontal whitespace)
  • (?:from|import) - either of the two words
  • \s+ - 1+ whitespaces
  • (\w+(?:\s*,\s*\w+)*) - 1 or more word chars, followed with 0+ occurrences of 0+ whitespaces, ,, 0+ whitespaces and then 1+ word chars.

In Python, you may later split the Group 1 value with re.split(r'\s*,\s*', group_1_value) to get individual comma-separated module names.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563