Remove multiple lines in Python

Question

I have a file that looks like this:

<VirtualHost *:80>
    ServerName Url1
    DocumentRoot Url1Dir
</VirtualHost>

<VirtualHost *:80>
    ServerName Url2
    DocumentRoot Url2Dir
</VirtualHost>

<VirtualHost *:80>
    ServerName REMOVE
</VirtualHost>

<VirtualHost *:80>
    ServerName Url3
    DocumentRoot Url3Dir
</VirtualHost>

Where i want to remove this piece of code (it doesn't change):

<VirtualHost *:80>
    ServerName REMOVE
</VirtualHost>

I have tried to find the whole piece of code by using the code below, but it doesn't seem to work.

with open("out.txt", "wt") as fout:
        with open("in.txt", "rt") as fin:
            for line in fin:
                fout.write(line.replace("<VirtualHost *:80>\n    ServerName REMOVE\n</VirtualHost>\n", ""))

Shouldn't you do `fin.read()` or something? Does `for line in fin` work like that? If it did, you are reading the file line by line, so replacing 3 lines wouldn't work... — OneCricketeer, Jan 01 '16 at 20:37
@cricket_007 wow... I see now that it's checking line by line. Thanks for the help :) — Mads Andersen, Jan 01 '16 at 20:40
reading several lines at once: http://stackoverflow.com/questions/1657299/how-do-i-read-two-lines-from-a-file-at-a-time-using-python — timgeb, Jan 01 '16 at 20:41
That looks like an XML file, so maybe you could use an XML parser for that task. E.g. [lxml](http://lxml.de/) or [beautifulsoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). — karlson, Jan 01 '16 at 20:41
As karlson wrote, you should probably update the question to express what is the goal. A better solution may be slightly different from what you are trying now. You can even use the built-in module xml.etree.ElementTree — pepr, Jan 01 '16 at 20:55
@pepr Yeah, but the code i have now is also going to work with something i'm going to make in the future where it isn't XML, so it isn't tied down to removing XML. — Mads Andersen, Jan 01 '16 at 21:03
@MadsAndersen I see, it is probably the Apache configuration file. If the situation is more complex than this one, you may be interested in using a finite automata and process the lines one by one. Say, you want to fix the content of one `VirtualHost` section. — pepr, Jan 01 '16 at 21:19
@MadsAndersen: See http://stackoverflow.com/a/34560380/1346705 for what I mean. — pepr, Jan 01 '16 at 21:48

score 4 · Accepted Answer · answered Jan 01 '16 at 20:43

The quickest way would be to read the whole file into a string, perform the replacement and then write the string out to the file you need. For example:

#!/usr/bin/python

with open('in.txt', 'r') as f:
      text = f.read()

      text = text.replace("<VirtualHost *:80>\n    ServerName REMOVE\n</VirtualHost>\n\n", '')

      with open('out.txt', 'w') as f:
            f.write(text)

score 1 · Answer 2 · answered Jan 01 '16 at 21:46

Here is the finite-automaton solution that can be easily modified later during the development. It may look complicated at first, but notice that you can look at the code for each status value independently. You can draw a graph (nodes as circles and arrows as oriented edges) on the paper to get the overview of what is done

status = 0      # init -- waiting for the VirtualHost section
lst = []        # lines of the VirtualHost section
with open("in.txt") as fin, open("out.txt", "w") as fout:
    for line in fin:

        #-----------------------------------------------------------
        # Waiting for the VirtualHost section, copying.
        if status == 0: 
            if line.startswith("<VirtualHost"):
                # The section was found. Postpone the output.
                lst = [ line ]  # first line of the section
                status = 1
            else:
                # Copy the line to the output.
                fout.write(line)

        #-----------------------------------------------------------
        # Waiting for the end of the section, collecting.
        elif status == 1:   
            if line.startswith("</VirtualHost"):
                # The end of the section found, and the section
                # should not be ignored. Write it to the output.
                lst.append(line)            # collect the line
                fout.write(''.join(lst))    # write the section
                status = 0  # change the status to "outside the section"
                lst = []    # not neccessary but less error prone for future modifications
            else:
                lst.append(line)    # collect the line
                if 'ServerName REMOVE' in line: # Should this section to be ignored?
                    status = 2      # special status for ignoring this section
                    lst = []        # not neccessary 

        #-----------------------------------------------------------
        # Waiting for the end of the section that should be ignored.
        elif status == 2:   
            if line.startswith("</VirtualHost"):
                # The end of the section found, but the section should be ignored.
                status = 0  # outside the section
                lst = []    # not neccessary

I'm actually using this script instead because it can also remove the DocumentRoot line, with some few edits. Thanks! — Mads Andersen, Jan 02 '16 at 11:50
:) When adding more `elif status == x:`, it is a good idea to add also the `else:` for the case when the new status is not implemented -- with some diagnostics. My experience is that it is better not to renumber the status values, just add a new one. The finite automaton may need some special case like unexpected end of file or so. Then I choose the _visible_ status numbers like `555`. — pepr, Jan 03 '16 at 00:20

felipsmartins · Answer 3 · 2016-01-10T16:55:23.397

While the above answer is a pragmatic approach, it is fragile and not flexible in first.
Here is something somewhat less fragile:

import re

def remove_entry(servername, filename):
    """Parse file , look for entry pattern and return new content

    :param str servername: The server name to look for
    :param str filename: The file path to parse content
    :return: The new file content excluding removed entry
    :rtype: str
    """
    with open(filename) as f:       
        lines = f.readlines()        
        starttag_line = None
        PATTERN_FOUND = False       

        for line, content in enumerate(lines):
            if '<VirtualHost ' in content: 
                starttag_line = line       
            # look for entry
            if re.search(r'ServerName\s+' + servername, content, re.I):
                PATTERN_FOUND = True
            # next vhost end tag and remove vhost entry
            if PATTERN_FOUND and '</VirtualHost>' in content:
                del lines[starttag_line:line + 1]
                return "".join(lines)        


filename = '/tmp/file.conf'

# new file content
print remove_entry('remove', filename)

Remove multiple lines in Python

3 Answers3