I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:
Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:
HISTORY: History of meningioma, surveillance
FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.
Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
HISTORY: Admitted patient with new fever, concern for pneumonia
FINDINGS: A CT of the chest demostrates bla bla bla
bla bla bla
[report_end]
When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.
It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.
Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).
Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).
Any thoughts?
Thanks! Maya