Python : How to collapse the contents of several rows into one cell when importing csv in pandas

Question

I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:

Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:   
  HISTORY: History of meningioma, surveillance

  FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.

  Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
  HISTORY: Admitted patient with new fever, concern for pneumonia

  FINDINGS: A CT of the chest demostrates bla bla bla
  bla bla bla

[report_end]

When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.

It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.

Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).

Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).

Any thoughts?

Thanks! Maya

Because of your substitution, I can't tell what is formating and what is data. Can you give us a more realistic complete example. There is a way to do this but it depends largely on the amount of effort you put in to help us understand what the file looks like. — piRSquared, Sep 07 '17 at 03:49

score 1 · Answer 1 · answered Sep 08 '17 at 15:32

Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.

Assuming your file name is test.txt

txt = open('test.txt').read()

names, txt_ = txt.split('\n', 1)
names = names.split('|')

pd.DataFrame(
    [t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
    columns=names)

          Name     MRN           DOB   Type_Imaging   Report_Status                                         Report_Text
0     John Doe    1234    01/01/1995            MRI        Complete    Exam Number: A5678\nReport status: final\nTyp...
1  James Smith    5678    05/05/1987             CT        Complete   Exam Number: A8623\nReport status: final\nType...

score 0 · Answer 2 · answered Sep 13 '17 at 21:48

I ended up doing this which worked:

    import re
    import pandas as pd

    f = open("filename.txt", "r”)
    data = f.read().replace("\n", “”)
    matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)

    df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”)

Python : How to collapse the contents of several rows into one cell when importing csv in pandas

2 Answers2