Splitting Pandas dataframe column full of strings by unique values Python

Question

I have a dataframe that contains a single series (column) of strings that are white space separated and concatenated together.

The dataframe column looks like:

[In]: df

[Out]:
 0   -- Generated 
 1   -- Formatting      
 2   -- Project1     
 3   -- Help        
 4   GRI                               -- Gen     
 5   S-P                               -- Gen
 6   COORD-SYS                         -- Gen
 7   COORD                             -- Gen
 8     318108.6945 6146696.895 1829.01714 318108.6945 6146696.895 
 9     1917.90444 334108.6945 6146696.895 1610.670059 334108.6945 
 10    6146696.895 1713.102355 350108.6945 6146696.895 1417.752346 350108.6945 
 11    NEF                             -- Gen
 12    318108.6945 6146696.895 1829.01714 318108.6945 6146696.895 
       2025.480832 326108.6945 6146696.895 1716.342492 326108.6945
 13    1917.90444 334108.6945 6146696.895 1610.670059 334108.6945 
       6146696.895 1813.845155 342108.6945 342108.6945
.
.

I'd like to separate this one column dataframe by the strings that aren't numbers and take all the numerical values underneath them until the next non-numeric value appears.

The output I need (for example for COORD and similar for NEF) would look like this:

[In]: df_COORD 

[Out]:


 0     COORD       1            2           3           4            5         
 1     318108.6945 6146696.895  1829.01714  318108.6945 6146696.895  nan
 2     1917.90444  334108.6945  6146696.895 1610.670059 334108.6945  nan
 3     6146696.895 1713.102355  350108.6945 6146696.895 1417.752346 350108.6945

I've first attempted at using something like:

df_COORD = pd.DataFrame(df[0][7:11]).reset_index(drop=True)
df_COORD= df_COORD.rename(columns=df_COORD.iloc[0]).drop(0)
df_COORD= df_COORD[file_coord.iloc[0][0]].str.split(expand=True)

Which works but I want to not have to explicitly state which rows I want; I'd like to separate based off the column value containing COORD or NEF or any other list of unique strings and grab data underneath until it reaches another lettered only string.

Is there any way I can do this?

Try [this](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe) or [that](https://stackoverflow.com/questions/36558954/pandas-dataframe-explode-column-contents) — m13op22, May 01 '19 at 17:35
This takes a list from a column and separates it but I'd like to grab all the data underneath each non-numerical column entry. — HelloToEarth, May 01 '19 at 19:05

score 0 · Answer 1 · answered May 01 '19 at 19:08

From my understanding, what you need is to flag the block of related rows and then use groupby to create separate data frames. Based on your description, a new DF should start from a line containing non-numbers. Assume your origin DF has this column with the default name 0:

# flag the block based on the regex and cumsum()
# Warning: you might also consider scientific notation of numbers in regex
g1 = (df[0].str.contains('[^\d.\s+-]')).cumsum()

# initialize the dict for all sub-dataframes
dfs = {}

for g, d in df.groupby(g1):
    # having only 1 row in the sub-dataframue, then they are unrelated and skipped
    if d[0].size == 1: continue

    # do whatever you want to groups with rows containing numbers
    d = d[0].str.strip(' \t').str.split(expand=True)
    dname = d[0].iat[0]
    dfs[dname] = d.iloc[1:,:].rename(columns={0:dname})
    print(dfs[dname])

Splitting Pandas dataframe column full of strings by unique values Python

1 Answers1