Trying to delete specific lines from file based on keyword

Question

I have a pretty specific problem. I am trying to delete certain lines out of a server configuration file based on a keyword find. If you scroll down the code below at the bottom, I am trying to delete the block of code that has the keyword "nasdaq" in the directory line. This includes everything from the "database" line all the way to the bottom where it reads "index termName pres, eq".

What is the best way I can go about this? String.find()? What commands should I use to delete lines above and below the keyword line?

Also, I could either delete the lines or just write to a new file and ignore the last block. Some guidance needed!

include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/core.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/cosine.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/inetorgperson.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/tuatara.schema
pidfile         /home/tuatara/TuataraServer-2.0/var/slapd.pid
argsfile        /home/tuatara/TuataraServer-2.0/var/slapd.args

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker"
suffix          "dc=CMDB-spellchecker,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker.medicinenet-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker.medicinenet"
suffix          "dc=CMDB-spellchecker.medicinenet,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker.medicinenet"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-nasdaq-20131127-12_37_43_PM
suffix          "o=CMDB-nasdaq"
suffix          "dc=CMDB-nasdaq,dc=com"
rootdn          "cn=admin,o=CMDB-nasdaq"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     100000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

I'd use sed: `sed -i ".backup" '/nasdaq/d' test.txt`. Found [here](http://stackoverflow.com/a/7050760/645270) — keyser, Dec 06 '13 at 19:08
I was instructed to use a Python module, otherwise I would've liked to use the easier way :/ — The Nomad, Dec 06 '13 at 19:09
[Suggestion 1](http://stackoverflow.com/a/11969474/645270), [suggestion 2](http://stackoverflow.com/a/6985814/645270) — keyser, Dec 06 '13 at 19:13

score 3 · Answer 1 · answered Dec 06 '13 at 19:18

3

As was already mentioned, sed is built for this kind of stuff, but you could do it in python with something like this:

with open('nasdaq.txt') as fin, open('nonasdaq.txt', 'w') as fout:
    for line in fin:
        if 'nasdaq' not in line:
            fout.write(line)

All it does is loop over the lines of the input file, and copies them to the output file if they don't contain the string 'nasdaq'.

answered Dec 06 '13 at 19:18

Apis Utilis

567
3
12

And maybe something like [this](http://stackoverflow.com/a/12791046/645270) to view previous lines and such. Or just do it manually (write three lines at a time). – keyser Dec 06 '13 at 19:21
This is close to what I need, but I also need to delete the 2 lines above and 16 lines below "directory" line containing "nasdaq". – The Nomad Dec 06 '13 at 19:22

eyquem · Accepted Answer · 2013-12-06T23:53:37.193

This should fit your need, I think:

import re

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*nasdaq.*\n'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'

filename = 'to_define.txt'

with open(filename,'rb+') as f:
    content = f.read()
    f.seek(0,0)
    f.write(re.sub(pat,'',content,flags=re.M))


    f.truncate()

It works only if sections are really separated with at least a void line (it may be a line '\n' or a line ' \t \n' with blanks and tabs, it doesn't matter)

.

'(?:^(?![ \t]*\r?\n).+\n)*?'\
'.*nasdaq.*\n'\
'(?:^(?![ \t]*\r?\n).+\n?)*'

[\t ] means a character that can be either a tab or a blank
[\t ]* means a character, that can be either a tab or a blank, is repeated 0 or more times
(?! begins an negative lookahead assertion
(?= begins a positive lookahead assertion
(?![\t ]*\r?\n) means there must not be the following sequence after this position: a succession of zero or more 'blank or tab' , a character \r (that may be absent) and the character newline \n
When I employ the word 'position' it means the location between two characters.
An assertion means something from the position it is placed.
In the above RE, the negative lookahead assertion is placed after the symbol ^ which means position before the first character of a line.
So the above assertion, as it is placed, means: from the position situated at the beginning of a line, there must not be a sequence 0 or more tab/blank-potential \r-\n.
Note that the symbol ^ means "beginning of a line" only if the flag re.MULTILINE is activated.

Now the partial RE (?! *\r?\n) is situated inside the following RE :
(?:^.+\n)*?
Normally, (...) defines a capturing group.
The consequence of puting ?: at the beginning between parens is that these parens no more define a capturing group. But (?:......) is usefull to define a RE.

Here .+\n means a succession of any character (except \n) and then a \n.

And ^.+\n (with flag re.M activated) means from the beginning of a line, the succession of any character except a newline and a newline
Note that, as a dot . matches with any character except \n, we are sure that .+ can't matches with a sequence going beyond the end of the line which is signaled by \n.
Then ^.+\n defines a line in fact !

Now what we have ?
There's a * after the uncatching group. It means that the substrings matching (?:^.+\n) are repeated 0 or more times: that is to say we match a succession of lines.

But not any line, since there's the negative lookahead assertion, which you now know the signification of.
So, what is matched by the RE (?:^(?![\t ]*\r?\n).+\n)* is : a succession of lines among which there is no void line. A void line being either \n or `\t\t\n or \t \t \n etc (I can't represent a line with only blanks in it , on srackoverflow, but it's also void line)

The question mark at the end of this RE means that the progression of the regex motor that matches such non-void lines one after the other must STOP as soon as it encounters the following RE.
And the following RE is .*nasdaq.*\n which means a line in which there is the word 'nasdaq'

There are some more subtleties but I will stop here.
I think the rest will also be more understandble for you.

.

EDIT

In case a section would be the last one and its last line would have nasdaq in it, it wouldn't be catched and deleted by the above regex.
To correct this, the part .*nasdaq.*\n must be replaced with .*nasdaq.*(\n|\Z) in which \Z means the very end of the string.

I also added a part to the regex to catched the void lines after each section, so the file is cleaned of these lines.

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*?nasdaq.*(\n|\Z)'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'\
      '(?:[\t ]*\r?\n)*'

I am trying to make sense of the reg expression pattern you wrote. You know any good tutorial page that could help me break it down? Thanks! — The Nomad, Dec 06 '13 at 21:15
A pleasure for me. Note the following point: ``re.sub(pat,'',content,flags=re.M)`` is written with ``flags=``. If not, an error is happening. Once, I searched during 2 hours the reason of this error. It's simply because the signature of ``re.sub()`` is ``re.sub(pattern, repl, string, count=0, flags=0)``. If ``flags=re.M`` isn't written, but simply ``re.sub(pattern, repl, string, re.M)``, the object ``re.M`` is assigned to the parameter ``count`` ! — eyquem, Dec 06 '13 at 21:17
Heh sorry. What I meant to decipher is the the `pattern = '(?:^(?![\t ]*\r?\n).+\n)*?.*nasdaq.*\n(?:^(?![\t ]*\r?\n).+\n?)*'` part. But uptick your answer for explaining that part too! — The Nomad, Dec 06 '13 at 21:21
The Python's doc on the module ``re`` ; the ``Regular Expression HOWTO`` in the Python's doc ; (http://www.regular-expressions.info/tutorial.html) — eyquem, Dec 06 '13 at 21:38
I've understood for the pattern. It's rather fastidious to explain patterns in some cases, but I will do it a little. — eyquem, Dec 06 '13 at 21:39
Wow thanks @eyquem for the lengthy explanation! I wish I could accept your answer twice lol! — The Nomad, Dec 06 '13 at 23:46
Hope it helps. I think it's hard to fully understand if one doesn't study the doc at the same time. I couldn't explain at the deepest level. - Note that I have plenty of other answers, some mediocre and some other that seem to deserve upvotes too! — eyquem, Dec 06 '13 at 23:52
I've again just added a little optimization in the regex: a question mark before ''nasdaq' — eyquem, Dec 06 '13 at 23:54

score 1 · Answer 3 · answered Dec 06 '13 at 19:30

1

with open('nasdaq.txt','r') as f:
    text = [l for l in f.read().splitlines()]

text = text[9:] # get rid of include headers
n = 20 # yours chunks are about this size

# sort chunks into list of lists
groups = []
for i in range(0, len(text), n):
    chunk = text[i:i+n]
    groups.append(chunk)

# get rid of unwanted lists by keyword
for ind,g in enumerate(groups):
    if any('nasdaq' in x for x in g):
        toss = groups.pop(ind)

answered Dec 06 '13 at 19:30

thefourtheye

607
1
6
19

The ``splitlines()`` method returns a list, so the name ``text`` doesn't suit well. By the way, ``[l for l in f.read().splitlines()]`` is ``f.read().splitlines()``. – eyquem Dec 06 '13 at 20:58
1

``text = text[9:]`` creates a new list (that is to say a new object at a new location in the memory) and assigns the name ``text`` to the new list: this isn't an in-place deletion of elements in former list. It can be seen by printing ``id(text)`` before and after the instruction. - To perform an in-place deletion, do ``del text[0:9]`` or ``text[0:9] = []`` – eyquem Dec 06 '13 at 21:03
What if the header isn't always 9 lines long or the length of sections not always 20 ? - To treat strings, the natural way is to use the regexes, apart when the problem is simple and rigid. It needs a certain work to learn the regex tool, but it's highly beneficial. - I upvote your answer because it needed a lot of work too :) – eyquem Dec 06 '13 at 21:08
Thanks. 1) I started writing l.split('\t') and then erased it. 2) True, but it was a quick and dirty script. 3) I did it that way because he said it was particular to this one problem, but yours is better :) – thefourtheye Dec 06 '13 at 21:20

Trying to delete specific lines from file based on keyword

3 Answers3

EDIT