0

I'm parsing a text file into few dictionaries so that I can write them to a CSV file. But now I have comments in the text file. How do I ignore the comment lines and work with rest of the content? I have checked few posts which recommend Pandas read_csv but it will work after I have a dataframe. I need to ignore the comments and read the rest content before parsing.

EDIT: I'm concerned with sql comments: -- and /* .... */

Part of my code: (form is a grammar defined by me)

with open("xyz.txt", 'r') as file:      
        if re.search(r'select|SELECT', file.read()):
            print("hello select")
            a = form.parseString(open('xyz.txt').read());
            z=a.asDict()

Text file:

/*this is a multi line comment which 
needs to be ignored */
select book from tab where b=100 --single line comment which should be ignored
select sal from emp where job_id=101

I tried using startswith(#) for single line comment but the code kept on running and no result..and I have no idea for multi line comments.

with open("xyz.txt", 'r') as file:
      for line in file:
            li=line.strip()
            if not li.startswith("#"):
                new=line.rstrip()
      while new:        
        if re.search(r'select|SELECT', file.read()):
            print("hello select")
            a = form.parseString(open('xyz.txt').read());
            z=a.asDict()
HAH
  • 151
  • 1
  • 15
  • u can `line.split(' #')[0]` to get rid of the single line comments – Shijith Jul 03 '19 at 07:08
  • post how should look the final csv content – RomanPerekhrest Jul 03 '19 at 07:12
  • @RomanPerekhrest Final CSV content is not my problem. The problem is to ignore the comments of the text file to perform further parsing and writing it to csv file. I can perform these operations only when the comments are ignored and rest of the content is read. – HAH Jul 03 '19 at 08:06
  • Please [edit] your question to @plain in more detail how the comments are defined. Do we need to cope with nested comments? What about comments inside quoted strings? Is there an escaping mechanism? Have you searched for solutions to remove C-style comments using Python? – tripleee Jul 03 '19 at 08:14

2 Answers2

1

you can check on each iteration if line is a multiline comment or not using a flag. For inline comments use split.(assuming that your queries will not have a '#'

multiline_comment_flag = False
with open(filepath) as fp:
    for line in fp:
        if not multiline_comment_flag:
            if line.startswith('/*'):
                multiline_comment_flag = True
                if line[:-1].endswith('*/'):
                    multiline_comment_flag = False
                continue
            else:
                line =  line.split('#')[0]
                if line:
                    print(line)
                    # add your code here
                else: continue

        else:
            if line[:-1].endswith('*/'):
                multiline_comment_flag = False
            continue
Shijith
  • 4,602
  • 2
  • 20
  • 34
  • It works for single line comment but for multi line comment it does not. eg: ` /* this is a multi line comment */ when I start from one line and take the comment to next line. It omits the first line but prints the next line part. – HAH Jul 03 '19 at 08:38
  • edited, this should work for multiline also, (single line starting with `#` and multiline starting with `/*`. – Shijith Jul 03 '19 at 08:44
  • Uhh..Now it only prints out /*...*/ content – HAH Jul 03 '19 at 08:53
  • 1
    sorry issue was line ending with '\n', corrected now – Shijith Jul 03 '19 at 08:56
1

Try using Regex.

Ex:

import re

with open("xyz.txt") as infile:
    data = infile.read()
    data = re.sub(r"(\/\*.*?\*\/)", "", data, flags=re.M|re.DOTALL)   #Delete Multiline Comment
    data = re.sub(r"(.*\s+\-\-.*)", "", data)  #Delete Single line Comment 
print(data.strip())
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • Using regex the code is short and simple! It works just the way I want. Thanks. – HAH Jul 03 '19 at 09:07
  • There are 2 regex one for multi and another for single, the single one, it removes the entire line where -- is used. eg: `select abc from tab --select query` It removes the entire line where as I want just the --select query part to be removed – HAH Jul 03 '19 at 09:31
  • 1
    `data = re.sub(r"(\-\-.*)", "", data)` – Rakesh Jul 03 '19 at 09:35