Extract specific lines from file and create sections of data in python

Question

Trying to write a python script to extract lines from a file. The file is a text file which is a dump of python suds output.

I want to:

strip all characters except words and numbers. I don't want any "\n", "[", "]", "{", "=", etc characters.
find a section where it starts with "ArrayOf_xsd_string"
remove the next line "item[] =" from the result
grab the remaining 6 lines and create a dictionary based on the unique number on the fifth line (123456, 234567, 345678) using this number as the key and the remaining lines as the values (pardon my ignorance if I'm not explaining this in pythonic terminology)
output the results to a file

Data in file is a list:

[(ArrayOf_xsd_string){
   item[] = 
      "001",
      "ABCD",
      "1234",
      "wordy type stuff",
      "123456",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "002",
      "ABCD",
      "1234",
      "wordy type stuff",
      "234567",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "003",
      "ABCD",
      "1234",
      "wordy type stuff",
      "345678",
      "more stuff, etc",
 }]

I tried doing a re.compile and here is my poor attempt at the code:

import re, string

f = open('data.txt', 'rb')
linelist = []
for line in f:
  line = re.compile('[\W_]+')
 line.sub('', string.printable)
 linelist.append(line)
 print linelist

newlines = []
for line in linelist:
    mylines = line.split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
      newlines.append([next(linelist) for _ in range(6)])
      print newlines

I'm a Python newbie and haven't found any results in google or on stackoverflow for how to extract specific number of lines after finding specific text. Any help is most appreciated.

Please ignore my code as I am taking "shots in the dark" :)

Here is what I'd like to see as the results:

123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc

I hope that helps with trying to interpret my flawed code.

smci · Accepted Answer · 2011-09-20T00:22:43.233

Several suggestions on your code:

Stripping all non-alphanumeric characters is totally unnecessary and timewasting; there is no need whatsoever to build linelist. Are you aware you can simply use plain old string.find("ArrayOf_xsd_string") or re.search(...)?

strip all characters except words and numbers. I don't want any "\n", "[", "]", "{", "=", etc characters.
find a section where it starts with "ArrayOf_xsd_string"
remove the next line "item[] =" from the result

Then as to your regex, _ is already covered under \W anyway. But the following reassignment to line overwrites the line you just read??

for line in f:
  line = re.compile('[\W_]+') # overwrites the line you just read??
  line.sub('', string.printable)

Here's my version, which reads the file directly, and also handles multiple matches:

with open('data.txt', 'r') as f:
    theDict = {}
    found = -1
    for (lineno,line) in enumerate(f):
        if found < 0:
            if line.find('ArrayOf_xsd_string')>=0:
                found = lineno
                entries = []
            continue
        # Grab following 6 lines...
        if 2 <= (lineno-found) <= 6+1:
            entry = line.strip(' ""{}[]=:,')
            entries.append(entry)
        #then create a dict with the key from line 5
        if (lineno-found) == 6+1:
            key = entries.pop(4)
            theDict[key] = entries
            print key, ','.join(entries) # comma-separated, no quotes
            #break # if you want to end on first match
            found = -1 # to process multiple matches

And the output is exactly what you wanted (that's what ','.join(entries) was for):

123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc

Using Python 2.6.1, I got the following error running the code: AttributeError: 'builtin_function_or_method' object has no attribute 'split' — fowbar, Sep 19 '11 at 16:58
(Fixed it - input.split('\n') was a hangover from testing, I had inlined your sample data while I was polishing the code). You could have fixed that for yourself, or at least tell me before you accept. I posted this a day earlier than the one you accepted. I think this one is better style and less obfuscated. In particular the chained comparison `if 2 <= (lineno-found) <= 6+1` is pretty clear and concise. Anyhoo... — smci, Sep 20 '11 at 00:27
I made a couple of small changes to make it work: entry = line.strip(' ""{}[]=:,\n') How would I add the output to a file? I tried a "for line in" statement but that only adds one line at a time. — fowbar, Sep 21 '11 at 17:06
You don't actually need to strip '[]=' since *`2 <= (lineno-found)`* guarantees 'items[]=' will not get used (as long as it's on a separate line). — smci, Sep 22 '11 at 18:38

sillyMunky · Answer 2 · 2011-09-16T23:45:49.637

If you want to extract the specific number of lines after a specific line that matches. You may as well simply read in the array with readlines, loop through it to find the match, then take the next N lines from the array too. Also, you could use a while loop along with readline, which is preferable if the files can get large.

The following is the most straight-forward fix to your code I can think of, but its not necessarily the best overall implementation, I suggest following my tips above unless you have good reasons not to or just want to get the job done asap by hook or crook ;)

newlines = []
for i in range(len(linelist)):
    mylines = linelist[i].split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
        for l in linelist[i+2:i+20]:
            newlines.append(l)
        print newlines

Should do what you want if I have interpreted your requirements properly. This says: take the next but one line, and the next 17 lines (so, up to but not including the 20th line after the match), append them to newlines (you cannot append a whole list at once, that list becomes a single index in the list you are adding them to).

Have fun and good luck :)

score 0 · Answer 3 · answered Sep 17 '11 at 01:28

Let's have some fun with iterators!

class SudsIterator(object):
    """extracts xsd strings from suds text file, and returns a 
    (key, (value1, value2, ...)) tuple with key being the 5th field"""
    def __init__(self, filename):
        self.data_file = open(filename)
    def __enter__(self):  # __enter__ and __exit__ are there to support 
        return self       # `with SudsIterator as blah` syntax
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.data_file.close()
    def __iter__(self):
        return self
    def next(self):     # in Python 3+ this should be __next__
        """looks for the next 'ArrayOf_xsd_string' item and returns it as a
        tuple fit for stuffing into a dict"""
        data = self.data_file
        for line in data:
            if 'ArrayOf_xsd_string' not in line:
                continue
            ignore = next(data)
            val1 = next(data).strip()[1:-2] # discard beginning whitespace,
            val2 = next(data).strip()[1:-2] #   quotes, and comma
            val3 = next(data).strip()[1:-2]
            val4 = next(data).strip()[1:-2]
            key = next(data).strip()[1:-2]
            val5 = next(data).strip()[1:-2]
            break
        else:
            self.data_file.close() # make sure file gets closed
            raise StopIteration()  # and keep raising StopIteration
        return key, (val1, val2, val3, val4, val5)

data = dict()
for key, value in SudsIterator('data.txt'):
    data[key] = value

print data

Thanks! This example worked exactly as it said it would. And I like how it splits each line out so if I want to add more, I can easily. Excellent for us newbies! — fowbar, Sep 19 '11 at 16:59

Extract specific lines from file and create sections of data in python

3 Answers3

Linked