Convert the values in one single column to new multiple columns pandas dataframe

Question

Say, I have a data frame with a single column like this:

MRN: 5399394
    Adfdf Kim
    Telemedicine
      3/29/2021
    INT Pediatric Specialties - 
    G
    Encounter providers: 
    DB Polar, MD (Genetics) and Bar K Wright, 
    RD/LD (Nutrition)
    Primary diagnosis: 
    HUG
    Reason for Visit: 
    Referred by Provider Not In System

The output I am looking forward to is :

MRN	Telemedicine	INT Pediatric Specialties	Encounter providers:	Primary diagnosis:	Reason for Visit:
5399394	3/29/2021	G	DB Polar, MD (Genetics) and Bar K Wright, RD/LD (Nutrition)	HUG	Referred by Provider Not In System
Adfdf Kim

I need to generate a new column with the strings ending with : or - or \n or \s

Telemedecine does not have a `:` or `-` char so why is it a column name ? — rochard4u, Aug 25 '23 at 11:57
Sorry, I updated my question. The separators can be new line, space or full colon or hyphen — ARJ, Aug 25 '23 at 11:59
It poses another problem. How do you differentiate between a field name and a value ? A new column is generated with new lines and values are terminated with new lines as well. — rochard4u, Aug 25 '23 at 12:04
That is the challenge here. I have extracted the provided column or input from a pdf. The pdf contains both text and tables. I used `PyPDF2` to extract the text, however, the text was all extracted as a single column in the CSV file. Therefore, I thought to modify the CSV as I wanted. If you have any suggestions to directly extract the table from the pdf would be great. So far I have used `pypdf2` and fitz or `pymuPDF` libraries. I have no JAVA license for `Tabula` at the moment. — ARJ, Aug 25 '23 at 12:12
Have you tried all the methods listed [here](https://stackoverflow.com/a/75419365/19593035) ? — rochard4u, Aug 25 '23 at 12:17

score -1 · Answer 1 · answered Aug 26 '23 at 17:49

This question can be solved very easily by using pandas and numpy:

import pandas as pd
import numpy as np

data = "MRN: 5399394 \n Adfdf Kim \n Telemedicine: \n 3/29/2021 \n INT Pediatric Specialties - \n G \n Encounter providers: \n DB Polar, MD (Genetics) and Bar K Wright, \n RD/LD (Nutrition) \n Primary diagnosis: \n HUG \n Reason for Visit: \n Referred by Provider Not In System"

x = data.split("\n")
x = [i.strip() for i in x]

d = {}

print(x)

flag = False
column = ""
multi_entry = False
add_all = False
count = 0
for i in x:
    for j in i:
        if j in [":", "-"]:
            flag = True
            if j == ":":
                colon = True
            else:
                colon = False

            if multi_entry:
                add_all = True
        if j == ",":
            multi_entry = True
            count += 1
            break

    if flag:
        if colon:
            temp = i.split(":")

        else:
            temp = i.split("-")

        d[temp[0].strip()] = []
        if temp[1] != '':
            d[temp[0].strip()].append(temp[1].strip())
        column = temp[0].strip()

        if add_all:
            current_index = x.index(i)

            d[list(d.keys())[-2]].append("\n ".join([x[index]
                                                     for index in range(current_index-1-count, current_index-1)]))
            multi_entry = False
            add_all = False

    else:
        if multi_entry:
            continue
        d[column].append(i)

    flag = False

# Determine the maximum length among all lists
max_length = max(len(v) for v in d.values())

# Pad the lists with NaN to make them the same length
for key, values in d.items():
    d[key] = values + [float('nan')] * (max_length - len(values))

# Create a DataFrame from the padded dictionary
df = pd.DataFrame(d)

print(df)

Since you didn't specify that your dataframe had multiple row entries or a single row entry as this, I assumed that the question had a single row. Either way, you could extract the entire data from the dataframe and come with a a list of entries. You could then split it up based on the column delimeters that you specified.

Hope this helps!

Convert the values in one single column to new multiple columns pandas dataframe

1 Answers1