How to remove a duplicated block of text using python

Question

I am working with text files that are radiology reports. If a document has two pages there is a block of text containing the patient name and other metadata that is repeated at the top of all the pages with the rest of the page containing the contents of the report. I have merged the pages into a single text object. Keeping the first block I want to remove all the other repeating blocks. Is there a way to remove these blocks programmatically from all such files? The repeating blocks look something like this:

 Patient ID            xxx                 Patient Name           xxx
 Gender                 Female                         Age                     43Y 8M
 Procedure Name         CT Scan - Brain (Repeat)       Performed Date          14-03-2018
 Study DateTime         14-03-2018 07:10 am            Study Description       BRAIN REPEAT
 Study Type             CT                             Referring Physician     xxx

If you know how each block starts and ends then yes because there is a pattern — SPYBUG96, Oct 25 '18 at 15:59
Thanks SPYBUG96. Yes I do. I have edited the question with the pattern of the block added as an example. I wanted to do it on a batch of files using python. — dratoms, Oct 25 '18 at 16:08
a multiple line-based solution: https://stackoverflow.com/a/68614409/191246 — ccpizza, Aug 01 '21 at 21:21

Charles Landau · Answer 1 · 2018-10-25T16:21:10.253

0

A plaintext file can be represented as a sequence in python. Consider plain.txt below:

This is the first line!\n
This is the second line!\n
This is the third line!\n

You can use the with reserved word to create a context that managed the open/close logic like so:

with open("./plain.txt", "r") as file:
    for line in file:
        # program logic
        pass

"r" refers to the mode that open uses.

So with this idiom you can store the repeating value and ignore it whenever it is encountered, in a manner that suits your file access pattern.

Edit: I saw your edit and it looks like this is actually a csv, right? If so I reccommend the pandas package.

import pandas as pd # Conventional namespace is pd

# Check out blob, os.walk, os.path for programmatic ways to generate this array
files = ["file.csv", "names.csv", "here.csv"] 

df = pd.DataFrame()
for filepath in files:
    df = df.append(pd.read_csv(filepath))

# To display result
print(df)

# To save to new csv
df.to_csv("big.csv")

edited Oct 25 '18 at 16:21

answered Oct 25 '18 at 16:14

Charles Landau

4,187
1
8
24

1

hi. Thanks. No its not a CSV, its text in a table format at the top of each page. The rest of the page contains the findings within the report. – dratoms Oct 25 '18 at 16:24
Ok then I think the for-loop in my original blurb is more relevant. What happens if you print each line (i.e. replace `pass` with `print(line)` in the example code)? You can pick an example file at random since you seem confident they are all organized the same way – Charles Landau Oct 25 '18 at 16:32
Thank. I am new at this. Will take some time to try this out. Will get back to you when I do that. – dratoms Oct 25 '18 at 16:38

score 0 · Answer 2 · answered Oct 25 '18 at 16:17

0

Assuming you can put each individual page into a list for a document

def remove_patient_data(documents: list, pattern: str) -> str:
    document_buffer = ""
    for count, document in enumerate(documents):
        if count != 0:
            document = document.replace(pattern, "")
        document_buffer += document + '\n'
    return document_buffer

my_documents = ["blah foo blah", "blah foo bar", "blah foo baz"]
remove_patient_data(my_documents, "foo")

Which would return

'blah foo blah\nblah bar\nblah baz\n'

answered Oct 25 '18 at 16:17

seventyseven

41
4

I want to use it on a batch of a few 100 similar files. Although the pattern remains the same the names and the dates will be different. So should I use regex in the pattern variable in your solution? And if so could you suggest a Regex sequence? – dratoms Oct 25 '18 at 16:27
Is the first word after the patient data always the same? – seventyseven Oct 25 '18 at 16:37
No that might also change. – dratoms Oct 25 '18 at 16:46
This is a difficult one, since there is no clear delimiter between the "patient metadata" and the rest of the document. If there are a small number of possible "Referring Physicians" you cycle through a templated regex pattern with all possible physicians – seventyseven Oct 25 '18 at 22:11
after the name of the referring physician there will be a new line character. Hope that can act as the delimiter. Referring Physician \s* Some name\n. Thats where there rest starts. I have been trying regex sequences for the entire block but I just can't get it right. – dratoms Oct 26 '18 at 15:17

score 0 · Answer 3 · answered Oct 25 '18 at 16:31

You could find the starting indices of all the occurrences of the patient data by doing:

str.find(sub,start,end)

where

sub : It’s the substring which needs to be searched in the given string -- in your case, it would be the patient data start : Starting position where sub is needs to be checked within the string end : Ending position where suffix is needs to be checked within the string

it would return the LOWEST index of the occurence of the searched string (patient data).

You can do this process in a loop, to get all the indices where the patient data occurs.

Then you can replace the patient data from the second instance onwards, by doing something like:

str_new = ''.join(( str_old[ : indicies[1] ], '' , s_old[ indicies[2] + len(str_old) + 1 : ] ))
  ... assuming a total of 3 pages in your record.

Other Alternative:

str.replace(old, new [, max])

where

old: − This is old substring to be replaced -- in your case the patient data
new: − This is new substring, which would replace old substring -- this could be ' ' (whitespace) max: − If this optional argument max is given, only the first count occurrences are replaced -- this would mean that the patient data would now appear on the last page only.

Thanks. Will try it out. The names and the dates might change. Can we give regex sequence for the substring? — dratoms, Oct 25 '18 at 16:48

How to remove a duplicated block of text using python

3 Answers3