1

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?

Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.

I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).

The original dataframe (from a poorly structured XLS file) looks a bit like this:

0                                       NaN             RIGHT      NaN   
1                                      Date              UCVA      Sph   
2                       2007-01-13 00:00:00              6/38  [-2.00]   
3                       2009-11-05 00:00:00               6/9      NaN   
4                       2009-11-18 00:00:00              6/12      NaN   
5                       2009-12-14 00:00:00               6/9  [-1.25]   
6                       2018-04-24 00:00:00           worn CL  [-5.50]   

           3     4      5                 6     7     8        9   \
0         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
1         Cyl  Axis  BSCVA  Pentacam remarks    K1    K2  K2 back   
2     [-2.75]    65    6/9               NaN   NaN   NaN      NaN   
3         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
4         NaN   NaN    6/5         Pentacam     46  43.9     -6.6   
5     [-5.75]    60  6/6-1               NaN   NaN   NaN      NaN   
6     [+7.00}   170  6/7.5               NaN   NaN   NaN      NaN   

           ...              17                18    19    20       21     22  \
0          ...             NaN               NaN   NaN   NaN      NaN    NaN   
1          ...           BSCVA  Pentacam remarks    K1    K2  K2 back  K max   
2          ...             6/5               NaN   NaN   NaN      NaN    NaN   
3          ...             NaN               NaN   NaN   NaN      NaN    NaN   
4          ...             NaN          Pentacam  44.3  43.7     -6.2   45.5   
5          ...           6/4-4               NaN   NaN   NaN      NaN    NaN   
6          ...             6/5               NaN   NaN   NaN      NaN    NaN   

I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:

1                                      Date              R-UCVA      R-Sph   
2                       2007-01-13 00:00:00              6/38  [-2.00]   
3                       2009-11-05 00:00:00               6/9      NaN   
4                       2009-11-18 00:00:00              6/12      NaN   
5                       2009-12-14 00:00:00               6/9  [-1.25]   
6                       2018-04-24 00:00:00           worn CL  [-5.50]   

1       R-Cyl R-Axis R-BSCVA  R-Penta          R-K1   R-K2  R-K2 back   
2     [-2.75]    65    6/9               NaN   NaN   NaN      NaN   
3         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
4         NaN   NaN    6/5         Pentacam     46  43.9     -6.6   
5     [-5.75]    60  6/6-1               NaN   NaN   NaN      NaN   
6     [+7.00}   170  6/7.5               NaN   NaN   NaN      NaN  

etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).

Any suggestions? I hope that's enough information.

Thanks very much in advance.

Kind regards Vicky

Vicky
  • 11
  • 1
  • Do I understand the question correctly: you want to extract a set of columns from a Dataframe into a new Dataframe and then merge a large number of Dataframes together? Do you want to merge them on a column or do you want to stack them? – Edgar H Feb 12 '19 at 16:14
  • It looks like a good place to start is to use the `header` and `skiprows` parameters when you read in your file, assuming each is formatted similarly. That will get your columns indexed, and you can select the ones you want from there – G. Anderson Feb 12 '19 at 16:28

1 Answers1

0

Here a something that will hopefully get you started. I have prepared a text.xlsx file: Excel file and I can read it as follows

    path = 'text.xlsx'

    df = pd.read_excel(path, header=[0,1])

    # Deal with two levels of headers, here I just join them together crudely 
    df.columns = df.columns.map(lambda h: '  '.join(h))

    # Slight hack because I messed with the column names
    # I create two dataframes, one with the first column, one with the second column
    df1 = df[[df.columns[0],df.columns[1]]]
    df2 = df[[df.columns[0], df.columns[2]]]

    # Stacking them on top of each other
    result = pd.concat([df1, df2])
    print(result)

    #Merging them on the Date column
    result = pd.merge(left=df1, right=df2, on=df1.columns[0])
    print(result)

This gives the output

  RIGHT  Sph RIGHT  UCVA       Unnamed: 0_level_0  Date
0        NaN              6/38      2007-01-13 00:00:00
1        NaN              6/37      2009-11-05 00:00:00
2        NaN              9/56      2009-11-18 00:00:00
0    [-2.00]               NaN      2007-01-13 00:00:00
1        NaN               NaN      2009-11-05 00:00:00
2        NaN               NaN      2009-11-18 00:00:00

and

  Unnamed: 0_level_0  Date RIGHT  UCVA       RIGHT  Sph
0      2007-01-13 00:00:00              6/38    [-2.00]
1      2009-11-05 00:00:00              6/37        NaN
2      2009-11-18 00:00:00              9/56        NaN

Some pointers: How to merger two header rows? See this question and answer.

How to select pandas columns conditionally? See e.g. this or this

How to merge dataframes? There is a very good guide in the pandas doc

Edgar H
  • 1,376
  • 2
  • 17
  • 31