How to extract lines in text file and find duplicates

Question

I have a text file which has many lines written ,there is a word called "@Testrun" in text file many times , considering "@Testrun" as staring point and endpoint also as a "@Testrun" considering the lines between these two "@Testrun" as one part there can be more that 3-4 parts o these text . My question is how do I extract those lines in parts and find duplicate lines in those parts :

My text file looks like this:

@TestRun
    And user validate message on screen "Switch to paperless" 
    And user click on "Manage accounts" label 
    And user click link with label "View all online services" 
    And user waits for 10 seconds 
    Then page is successfully launched 
    And user click link with label "Go paperless for complete convenience" 
    Then page is successfully launched 
    And user validate message on screen "#EmailAddress" 
    And user clicks on the button "Confirm" 
    Then page is successfully launched 
    And user validate message on screen "#MessageValidate" 
    Then page is successfully launched 
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And user validate "Switch to paperless" button is disabled 
    And user validate message on screen "Online only" 
    When user click on "Log out" label 
    Then page is successfully launched

@TestRun 
    And user click on link "Mobile site" 
    And user set text "#Surname" on textbox name "surname" 
    Then page is successfully launched 
    And user click on link "#Account" 
    Then page is successfully launched 
    And user verify message on screen "#Account" 
    And user verify message on screen "Manage statements" 
    And user verify message on screen "Step 1 of 3" 
    Then page is successfully launched 
    And user verify message on screen "Current format type"  
    And user verify message on screen "Online" 
    When user selects the radio button "Paper" 


@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And page is successfully launched 
    And user waits for 10 seconds 
@TestRun 
    Then page is successfully launched 
    And user waits for 10 seconds 
    And user click checkbox "Telephone" 
    And user click checkbox "Post" 
    And user clicks on the button "Save" 
    Then page is successfully launched

I tried out the following code but this is not working:

with open('CustPref.txt') as input_data:
    for line in input_data:
        if line.strip() == '@TestRun ':  
            break
    for line in input_data: 
        if line.strip() == '@TestRun ':
            break
        print line

I get output but it is totally incorrect. I get only one line as an output which is not expected.How do i solve this

line.strip() removes whitespaces - why would the line afterwards macht a string that has a whitespace at its end? — Patrick Artner, Dec 01 '17 at 06:06
@Jump3r That's hardly appropriate here. How exactly would regular expressions help? It's just a needless complication. — tripleee, Dec 01 '17 at 06:25
The `print line` looks like Python 2. If you are just learning Python, you should definitely consider switching to Python 3 -- the end-of-life for Python 2 was supposed to be a few months from now (though it was pushed back for pragmatic reasons), and it receives no attention from the Python community any longer. Unless you need to maintain software which cannot be updated to run on Python 3, you really want to avoid Python 2. — tripleee, Dec 01 '17 at 06:33

Patrick Artner · Answer 1 · 2017-12-01T06:52:37.910

You tackle 2 problems:

splitting it into the correkt parts
removing duplicates

Splitting:

1st Option

Parse the file line by line:

parts = [] # all lines between 2 @TestRun's 
chunks = [] # all chunks of lines between 2 @TestRun's 

startNow = False # wait till first @TestRun before keeping anything

for line in Text(): # see definition for Text() below - it mimics your open('...')
    if line.strip() == '@TestRun':
        startNow = True
        if len(parts) > 0: # found a Testrun, if parts contains lines append to chunks
            chunks.append(parts)
            parts = []
    elif startNow == True: # check if first TestRun hit, if so append line to parts
        parts.append(line)


print(chunks) # done -> list of list of lines between chunks.

2nd Option

Do not split the text by lines, read in in as a complete text and use list comprehension to split it:

biggerChunks = [x.strip() for x in TextTT().split("@TestRun") ]
chunkified = [x.splitlines() for x in biggerChunks if len(x.strip()) > 0 ]

You split first on @TestRun and get a list of big text-chunks, then split each down by lines. Result is about the same: [ [all lines between 2 @TestRun's] ]

Removing duplicates (while keeping the order)

was answered here: how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order - it is a SO link so not going to regurgitate it here again :)

Helpers Text() is a replace for your file open, TestTT() is the whole chunk of text:

def Text(): # instead of file open, returns list of lines
    return TextTT().splitlines() 

def TextTT(): # unsplit text
    return '''
@TestRun
    And user validate message on screen "Switch to paperless" 
    And user click on "Manage accounts" label 
    And user click link with label "View all online services" 
    And user waits for 10 seconds 
    Then page is successfully launched 
    And user click link with label "Go paperless for complete convenience" 
    Then page is successfully launched 
    And user validate message on screen "#EmailAddress" 
    And user clicks on the button "Confirm" 
    Then page is successfully launched 
    And user validate message on screen "#MessageValidate" 
    Then page is successfully launched 
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And user validate "Switch to paperless" button is disabled 
    And user validate message on screen "Online only" 
    When user click on "Log out" label 
    Then page is successfully launched

@TestRun 
    And user click on link "Mobile site" 
    And user set text "#Surname" on textbox name "surname" 
    Then page is successfully launched 
    And user click on link "#Account" 
    Then page is successfully launched 
    And user verify message on screen "#Account" 
    And user verify message on screen "Manage statements" 
    And user verify message on screen "Step 1 of 3" 
    Then page is successfully launched 
    And user verify message on screen "Current format type"  
    And user verify message on screen "Online" 
    When user selects the radio button "Paper" 


@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And page is successfully launched 
    And user waits for 10 seconds 
@TestRun 
    Then page is successfully launched 
    And user waits for 10 seconds 
    And user click checkbox "Telephone" 
    And user click checkbox "Post" 
    And user clicks on the button "Save" 
    Then page is successfully launched 
'''

See comments for explanation - you can use f.e. itertools.chain to recombine inner lines if needed

pylang · Accepted Answer · 2017-12-01T09:20:06.143

0

Using the more_itertools third-party library, we can split the text before the desired target.

UPDATE: we can drop lines before the first target using itertools.dropwhile.

import itertools as it
import more_itertools as mit


with open("CustPref.txt", "r") as f:
    lines = f.readlines()

    pred = lambda x: x.startswith("@TestRun")      # trailing-space protection
    inv_pred = lambda x: not pred(x)

    lines = it.dropwhile(inv_pred, lines)          # optional
    chunks = list(mit.split_before(lines, pred))

print(chunks)

Output (abbreviated)

[['@TestRun\n',
  '    And user validate message on screen "Switch to paperless" \n',
  ...],
 ['@TestRun \n',
  '    And user click on link "Mobile site" \n',
  ...],
 ['@TestRun\n',
  'Then user wait for page load\n',
  ...],
 ...]

edited Dec 01 '17 at 09:20

answered Dec 01 '17 at 06:25

pylang

40,867
14
129
121

What if I have text before 1st testrun , will it be considered as one seprate list , what if I dont need that to be in list? I tried this approach it gives me perfect esult but then if I have some text before the 1st testrun , it considers it as one seperate list how can we avoid these ? – Cyley Simon Dec 01 '17 at 08:08
Once you have a list, you can *manually* slice it however you wish, e.g. `chunks[1:]` selects all but the first entry. If you wish to *automatically* drop lines before the first target, use `itertools.dropwhile` (see update). – pylang Dec 01 '17 at 09:16
what if in every list i want to give some index number for every element and if within a list there are some duplicate elements it should have same index number. I tried it using this def list_duplicates(seq): seen = set() seen_add = seen.add return [idx for idx,item in enumerate(seq) if item in seen or seen_add(item)] list_duplicates(list_1) but this does not helps – Cyley Simon Dec 01 '17 at 11:48
Then you probably want to use a dictionary. I'm afraid what you are asking now is a different problem than what most readers can infer from your original post. I would accept an answer and then ask a new question, which appears to be "given a chunk of lines from a file, how do I enumerate unique lines?" Give examples of an input and expected output. – pylang Dec 01 '17 at 18:19
Thank you for , answering the query, I have tried out of indexing common steps with same index and I have got the result so this is how I have done pd.factorize(chunks[0]), numbers = pd.factorize(chunks[0]) – Cyley Simon Dec 04 '17 at 04:06
Ah, now it's clear. FYI, you can also factorize in pure python with a few tricks. `c = itertools.count()`, `factorize = collections.defaultdict(c.__next__)`, `[factorize[line] for line in chunks[0]]` – pylang Dec 04 '17 at 05:13

tripleee · Answer 3 · 2017-12-01T06:42:24.947

A simple approach would be to remember the lines you have already seen. You can collect them into a list, but it will be more efficient to use a dictionary or a set.

Read a line at a time. If this line (is not a new TestRun header and) it has already been seen before, don't print it. If it is a TestRun header, forget what you have seen. Print everything which gets this far in the loop. Start over with the next line.

with open('CustPref.txt') as input_data:
    seen = set()
    for line in input_data:
        # trim trailing newline
        line = line.rstrip('\n')
        if line == '@TestRun ':   # really sure about the trailing space?
            seen = set()          # who am I? what day is it?
        elif line in seen:
            # skip the rest of the for loop and start over
            continue
        else:
            seen.add(line)
        print(line)

Programmatically, it makes sense to check "if it is @TestRun, else if already seen, else add to seen" in this order so you don't have to check if it's a @TestRun twice. I wanted to keep the more-natural order in the exposition above to make it simpler.

How to extract lines in text file and find duplicates

3 Answers3