Operating on a file with a very specific format

Question

I have been trying to write the following function:

def track(filepath,n1,n2)

This function is meant to operate on a file with the following format:

-BORDER-
text
-BORDER-
text
-BORDER-
text
-BORDER-

How do I tell the function to operate on this file path and more precisely on the text inside each border?

Is your file a `.txt` file ? and is there always only one text line between your Border lines ? — MMF, Oct 31 '16 at 09:31
yes, but between the border lines i would always have more text lines — mewtire, Oct 31 '16 at 09:35
If i get it : you might have many text lines between your Border lines and your Border lines are all the same ? — MMF, Oct 31 '16 at 09:45
yeah, the start and the end of the text is marked by a line which contains only the word -BORDER- — mewtire, Oct 31 '16 at 09:51

Martin Evans · Answer 1 · 2016-10-31T12:04:17.670

The following approach would read your file in, and give you a list of non border lines:

from itertools import groupby

with open('input.txt') as f_input:
    for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
        if k:
            print([line.strip() for line in g])

So if your input file was:

-BORDER-
text
-BORDER-
text
-BORDER-
this is some text
with words 
on different lines
-BORDER-

It would display the following output:

['text']
['text']
['this is some text', 'with words', 'on different lines']

This works by reading your file in line by line, and using Python's groupby function to group lines matching a given test. In this case the test is whether or not the line starts -BORDER-. It returns all following lines which return the same result. The k is the test result, and the g is the group of matching lines. So if the test result is True, it means it did not start with -BORDER-.

Next, as each of your lines has a newline, a list comprehension is used to strip this from each of the returned lines.

If you wanted to count the words (assuming they are delimited by spaces) then you could do the following:

from itertools import groupby

with open('input.txt') as f_input:
    for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
        if k:
            lines = list(g)
            word_count = sum(len(line.split()) for line in lines)
            print("{} words in {}".format(word_count, lines))

Giving you:

1 words in ['text\n']
1 words in ['text\n']
9 words in ['this is some text\n', 'with words \n', 'on different lines\n']

and how do i operate on the -BORDER- text -BORDER- of the file? — mewtire, Oct 31 '16 at 10:21
If you take the `if` line out, it will show you all lines grouped. Or if you take the `not` out, it would only work on the `-BORDER-` lines. Alternatively, you could add an `else` line to the `if`. — Martin Evans, Oct 31 '16 at 10:24
could i also count words that match given property in each "paragraph" or in all the paragraphs? — mewtire, Oct 31 '16 at 11:44
I have added a suitable example to the answer to help get you started. — Martin Evans, Oct 31 '16 at 13:13

MMF · Answer 2 · 2016-10-31T09:56:04.760

0

To retrieve the text from your text file you can do as follows :

with open("/your/path/to/file", 'r') as f:
    text_list = [line for line in f.readlines() if 'BORDER' not in line]

text_list will contain all the text lines you are looking for. You can, if needed, strip the lines using .strip()

edited Oct 31 '16 at 09:56

answered Oct 31 '16 at 09:43

MMF

5,750
3
16
20

you can also strip the line in the list comprehension, but I don't know how the strip would look like. I guess the line might end with a `'\n'`. – MMF Oct 31 '16 at 10:00
if i already have my file in one of my pc directories, is it better to "load" the file or not? – mewtire Oct 31 '16 at 10:10
What do you mean by 'load' it ? You can just read it using the context manager `with open(...) as f ` – MMF Oct 31 '16 at 10:11
don't mind, i I got mixed up – mewtire Oct 31 '16 at 10:14

score 0 · Accepted Answer · answered Oct 31 '16 at 10:03

Write a generator, that counts detect border lines and use groupby to separate these blocks:

from itertools import groupby

BORDER = '--border--'

def count_border(lines, border):
  cnt = 0
  for line in lines:
    if line.strip() == border:
        cnt += 1
    else:
        yield cnt, line

with open('file') as lines:
    for _, block in groupby(count_border(lines, BORDER), lambda (c,_): c):
        block = [line for _, line in block]
        print(block)

Operating on a file with a very specific format

3 Answers3