-6

I'm wondering how to

  1. determine (True / False) whether a line of Python code has a comment
  2. split the line into code, comment

Something such as:

loc_1 = "print('hello') # this is a comment"

Is very straightforward, but something such as:

loc_2 = for char in "(*#& eht # ": # pylint: disable=one,two # something

Is not so straightforward. I don't know how to go about this in general, such that I can do

f(loc_2)
# returns
[
    # code
    'for char in "(*#& eht # ":',
    # comment
    ' # pylint: disable=one,two # something'
]

From comments: "You tagged this with libcst. Have you already using that library to give you an AST?"

I have tried to use this and failed with it, e.g.:

From comments: "Are you parsing just single lines, the source code for a function or a class, or parsing whole modules?"

I'm parsing single lines - that was my intention at least. I wanted to be able to iterate over the lines in a file. I have a pre-existing process which already iterates over the lines of a python file, but I wanted to extend it to consider the comments, which motivated this question.

From comments "What is your eventual goal with the parse results, just get the source code without comments?"

No - I want the source code and the comment as given in the example.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
baxx
  • 3,956
  • 6
  • 37
  • 75
  • `loc_1 = "print('hello') # this is a comment` is not valid Python. There is a `"` too many or missing. – Martijn Pieters Aug 21 '21 at 18:19
  • You tagged this with `libcst`, are you already using that library to give you an AST? – Martijn Pieters Aug 21 '21 at 18:20
  • And finally: are you parsing just single lines, the source code for a function or a class, or parsing whole modules? What is your eventual goal with the parse results, just get the source code without comments? – Martijn Pieters Aug 21 '21 at 18:23
  • 1
    @MartijnPieters I've updated the post with answers to these, thanks – baxx Aug 21 '21 at 18:40
  • note i've never touched libcst before this, but it seems like that might be a solution, hence the tag – baxx Aug 21 '21 at 18:41
  • Perhaps if you gave a try at defining `f()` then others would help out as well. Check out `tokenize.generate_tokens()`. If this question was accepting answers, I would give you one. – JonSG Aug 21 '21 at 19:01
  • i still don't get what this would be good for you mknow that a # should trigger a new line and to fond your code add a single character besides that identifies code or four sacesp or what ever – nbk Aug 21 '21 at 19:02
  • 3
    this question is [discussed at meta](https://meta.stackoverflow.com/q/410925/839601) – gnat Aug 21 '21 at 20:04
  • @nbk I'm not sure what your comment means. – baxx Aug 21 '21 at 20:52
  • @baxx what didn't you understand your idea makes no sense to me why want you do that, python parses the code, so there is no need to split it, the only thing wozld be to display source code. In order to that it is simper to construct the lines so that coment and code can be distinguised by a identifer like # for comnts – nbk Aug 21 '21 at 20:58
  • @nbk there seems to be a lot of misspellings in your comments that a spell-checker would pick up, I'm struggling to understand the meanings of your comments. – baxx Aug 21 '21 at 21:04
  • 1
    You can’t do this if you are handling files line by line. How would you handle `some_string = """first` (newline) `second # not a comment` (newline) `last"""`. The basic unit in Python is a *statement*, some of which can contain other statements (such as functions and classes, but also loops and if and try and with statements). Any given line of text could contain 0, 1 or multiple complete statements or be part of a larger statement. You *have* to parse the text in a way that takes that into account. Usually, a module (file) at a time. – Martijn Pieters Aug 21 '21 at 21:35
  • I can’t help here because there is not enough information. You can try working with the cst library and see where you get stuck; start with the parsing functions. – Martijn Pieters Aug 21 '21 at 21:36
  • @MartijnPieters not enough information in what sense? I tried the parsing functions and it didn't work - the lines are parsed one at a time - so there wouldn't ever be new-lines in them. I haven't worked with parsers and what not enough to understand what is necessary, though I did feel that this might perhaps be impossible. – baxx Aug 21 '21 at 22:10

1 Answers1

0

I think you can get a good way down the path to an answer using io.StringIO() and tokenize.generate_tokens(). If applied to a string representing a line of python you should get a list of tokeninfo that you can then inspect for the comment token (tokenize.COMMENT).

Your f() method might look a bit like this but please make the name something meaningful rather than f()

import io
import tokenize

def separate_code_from_comments(text):
    reader = io.StringIO(text).readline
    comment_tokens = (t for t in tokenize.generate_tokens(reader) if t.type == tokenize.COMMENT)
    comment = next(comment_tokens, None)
    return [text[0: comment.start[1]], text[comment.start[1]:]] if comment else [text, '']

print(separate_code_from_comments("print('hello') # this is a comment"))
print(separate_code_from_comments("# this is a comment"))
print(separate_code_from_comments("loc_2 = for char in \"(*#& eht #\": # pylint: disable=one,two # something"))
print(separate_code_from_comments("loc_2 = for char in \"(*#& eht #\": "))

This should print out:

["print('hello') ", '# this is a comment']
['', '# this is a comment']
['loc_2 = for char in "(*#& eht #": ', '# pylint: disable=one,two # something']
['loc_2 = for char in "(*#& eht #": ', '']
JonSG
  • 10,542
  • 2
  • 25
  • 36