-2

Say, I have a data frame with a single column like this:

MRN: 5399394
    Adfdf Kim
    Telemedicine
      3/29/2021
    INT Pediatric Specialties - 
    G
    Encounter providers: 
    DB Polar, MD (Genetics) and Bar K Wright, 
    RD/LD (Nutrition)
    Primary diagnosis: 
    HUG
    Reason for Visit: 
    Referred by Provider Not In System

The output I am looking forward to is :

MRN Telemedicine INT Pediatric Specialties Encounter providers: Primary diagnosis: Reason for Visit:
5399394 3/29/2021 G DB Polar, MD (Genetics) and Bar K Wright, RD/LD (Nutrition) HUG Referred by Provider Not In System
Adfdf Kim

I need to generate a new column with the strings ending with : or - or \n or \s

ARJ
  • 2,021
  • 4
  • 27
  • 52
  • Telemedecine does not have a `:` or `-` char so why is it a column name ? – rochard4u Aug 25 '23 at 11:57
  • Sorry, I updated my question. The separators can be new line, space or full colon or hyphen – ARJ Aug 25 '23 at 11:59
  • 2
    It poses another problem. How do you differentiate between a field name and a value ? A new column is generated with new lines and values are terminated with new lines as well. – rochard4u Aug 25 '23 at 12:04
  • That is the challenge here. I have extracted the provided column or input from a pdf. The pdf contains both text and tables. I used `PyPDF2` to extract the text, however, the text was all extracted as a single column in the CSV file. Therefore, I thought to modify the CSV as I wanted. If you have any suggestions to directly extract the table from the pdf would be great. So far I have used `pypdf2` and fitz or `pymuPDF` libraries. I have no JAVA license for `Tabula` at the moment. – ARJ Aug 25 '23 at 12:12
  • Have you tried all the methods listed [here](https://stackoverflow.com/a/75419365/19593035) ? – rochard4u Aug 25 '23 at 12:17
  • Thanks, no I haven't. I can try those solutions. – ARJ Aug 25 '23 at 12:22

1 Answers1

-1

This question can be solved very easily by using pandas and numpy:

import pandas as pd
import numpy as np

data = "MRN: 5399394 \n Adfdf Kim \n Telemedicine: \n 3/29/2021 \n INT Pediatric Specialties - \n G \n Encounter providers: \n DB Polar, MD (Genetics) and Bar K Wright, \n RD/LD (Nutrition) \n Primary diagnosis: \n HUG \n Reason for Visit: \n Referred by Provider Not In System"

x = data.split("\n")
x = [i.strip() for i in x]

d = {}

print(x)

flag = False
column = ""
multi_entry = False
add_all = False
count = 0
for i in x:
    for j in i:
        if j in [":", "-"]:
            flag = True
            if j == ":":
                colon = True
            else:
                colon = False

            if multi_entry:
                add_all = True
        if j == ",":
            multi_entry = True
            count += 1
            break

    if flag:
        if colon:
            temp = i.split(":")

        else:
            temp = i.split("-")

        d[temp[0].strip()] = []
        if temp[1] != '':
            d[temp[0].strip()].append(temp[1].strip())
        column = temp[0].strip()

        if add_all:
            current_index = x.index(i)

            d[list(d.keys())[-2]].append("\n ".join([x[index]
                                                     for index in range(current_index-1-count, current_index-1)]))
            multi_entry = False
            add_all = False

    else:
        if multi_entry:
            continue
        d[column].append(i)

    flag = False

# Determine the maximum length among all lists
max_length = max(len(v) for v in d.values())

# Pad the lists with NaN to make them the same length
for key, values in d.items():
    d[key] = values + [float('nan')] * (max_length - len(values))

# Create a DataFrame from the padded dictionary
df = pd.DataFrame(d)

print(df)

Since you didn't specify that your dataframe had multiple row entries or a single row entry as this, I assumed that the question had a single row. Either way, you could extract the entire data from the dataframe and come with a a list of entries. You could then split it up based on the column delimeters that you specified.

Hope this helps!

Natália
  • 34
  • 1