0

I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:

Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:   
  HISTORY: History of meningioma, surveillance

  FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.

  Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
  HISTORY: Admitted patient with new fever, concern for pneumonia

  FINDINGS: A CT of the chest demostrates bla bla bla
  bla bla bla

[report_end]

When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.

It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.

Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).

Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).

Any thoughts?

Thanks! Maya

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Maya Harary
  • 417
  • 1
  • 4
  • 7
  • Because of your substitution, I can't tell what is formating and what is data. Can you give us a more realistic complete example. There is a way to do this but it depends largely on the amount of effort you put in to help us understand what the file looks like. – piRSquared Sep 07 '17 at 03:49
  • @piRSquared See edits. thanks! – Maya Harary Sep 08 '17 at 14:16

2 Answers2

1

Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.


Assuming your file name is test.txt

txt = open('test.txt').read()

names, txt_ = txt.split('\n', 1)
names = names.split('|')

pd.DataFrame(
    [t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
    columns=names)

          Name     MRN           DOB   Type_Imaging   Report_Status                                         Report_Text
0     John Doe    1234    01/01/1995            MRI        Complete    Exam Number: A5678\nReport status: final\nTyp...
1  James Smith    5678    05/05/1987             CT        Complete   Exam Number: A8623\nReport status: final\nType...
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

I ended up doing this which worked:

    import re
    import pandas as pd

    f = open("filename.txt", "r”)
    data = f.read().replace("\n", “”)
    matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)

    df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”) 
Maya Harary
  • 417
  • 1
  • 4
  • 7