14

I have a single .csv file containing multiple tables.

Using Pandas, what would be the best strategy to get two DataFrame inventory and HPBladeSystemRack from this one file ?

The input .csv looks like this:

Inventory       
System Name            IP Address    System Status
dg-enc05                             Normal
dg-enc05_vc_domain                   Unknown
dg-enc05-oa1           172.20.0.213  Normal

HP BladeSystem Rack         
System Name               Rack Name   Enclosure Name
dg-enc05                  BU40  
dg-enc05-oa1              BU40        dg-enc05
dg-enc05-oa2              BU40        dg-enc05

The best I've come up with so far is to convert this .csv file into Excel workbook (xlxs), split the tables into sheets and use:

inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)

However:

  • This approach requires xlrd module.
  • Those log files have to be analyzed in real time, so that it would be way better to find a way to analyze them as they come from the logs.
  • The real logs have far more tables than those two.
Stefan
  • 41,759
  • 13
  • 76
  • 81
JahMyst
  • 1,616
  • 3
  • 20
  • 39
  • Do the tables have a fixed number of rows? – crow_t_robot Dec 09 '15 at 17:29
  • Yes, but this number is different for each table. And I would like to avoid the approach of selecting by row numbers because the next log file might have more rows... – JahMyst Dec 09 '15 at 17:30
  • Is there a empty line(s) between two tables? Some thing solid and always :) – WoodChopper Dec 09 '15 at 17:31
  • You can use a combination of 'nrows' and 'skiprows' in pd.read_csv() to grab particular tables. You will have to know which row each table starts at and how many rows are in each table. – crow_t_robot Dec 09 '15 at 17:32
  • @WoodChopper Yes, there is an empty line between each table. – JahMyst Dec 09 '15 at 17:32
  • @crow_t_robot Ok but the goal would be to parse this log file automatically, and perhaps the next file has different number of rows, so that the code would be broken... – JahMyst Dec 09 '15 at 17:33
  • Split the file with first empty line and read two files as different df. – WoodChopper Dec 09 '15 at 17:35
  • You may want to consider these options: http://stackoverflow.com/questions/18492283/read-csv-file-to-pandas-data-frame-and-identify-data-sections-from-line-breaks – crow_t_robot Dec 09 '15 at 17:36
  • Your csv looks more like a fixed-width format (there don't seem to be any commas). Is that correct? – DSM Dec 09 '15 at 20:24
  • No, I just formatted it for stackoverflow, but it is a csv file with commas. – JahMyst Dec 09 '15 at 20:25
  • For the future, just post the raw data. We're none of us scared of raw text, and that way we can copy and paste directly to reproduce your situation. :-) – DSM Dec 09 '15 at 20:27
  • Sure, the raw data was just monstrously long, it was just easier to extract a few tables/columns/lines. – JahMyst Dec 09 '15 at 20:31

3 Answers3

19

If you know the table names beforehand, then something like this:

df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

should work to produce a dictionary with keys as the table names and values as the subtables.

>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
...     print("table:", k)
...     print(v)
...     print()
...     
table: HP BladeSystem Rack
              0          1               2
6   System Name  Rack Name  Enclosure Name
7      dg-enc05       BU40             NaN
8  dg-enc05-oa1       BU40        dg-enc05
9  dg-enc05-oa2       BU40        dg-enc05

table: Inventory
                    0             1              2
1         System Name    IP Address  System Status
2            dg-enc05           NaN         Normal
3  dg-enc05_vc_domain           NaN        Unknown
4        dg-enc05-oa1  172.20.0.213         Normal

Once you've got that, you can set the column names to the first rows, etc.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Can I ask what the names=range(3) does ? Is it the number of columns ? If it is, that would be a different number for each table. The code throws an exception for the complete dataset that has way more tables (each with different number of columns) than that. – JahMyst Dec 09 '15 at 20:45
  • 1
    @JahMyst: when you have a variable number of columns, pandas can get confused about how many there are. You can help it by telling it how many there are; `names=range(some_number)` tells it that. You can always use `dropna` to remove excess all-NaN columns after the fact, so I tend to use `names=range(some_big_number)` to start. – DSM Dec 10 '15 at 01:05
  • @DSM, it worked like a charm. I'm having hard time to understand the logic though. – GC 13 May 18 '20 at 23:38
  • I like this solution. I found that if I modify the code slightly, it will automatically generate a unique ID for all my tables (because I don't know their names in advance) ```tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}``` – skleijn Nov 22 '21 at 19:55
2

I assume you know the names of the tables you want to parse out of the csv file. If so, you could retrieve the index positions of each, and select the relevant slices accordingly. As a sketch, this could look like:

df = pd.read_csv('path_to_file')    
index_positions = []
for table in table_names:
    index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])

## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])

tables = {}
for position in index_positions[:-1]:
    table_no = index_position.index(position)
    tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]

There are certainly more elegant solutions but this should give you a dictionary with the table names as keys and the corresponding tables as values.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • I assume you would load the DataFrame as df = read_csv('path_to_file.csv') ? – JahMyst Dec 09 '15 at 19:48
  • The code above has some issues: first a DataFrame doesn't have get_loc(), so this would need to be modified to df.index.get_loc(). Second I get a 'KeyError' when trying to pass "Inventory" or "HP BladeSystem Rack" to get_loc. – JahMyst Dec 09 '15 at 19:51
  • Changed so that you filter the relevant column - that has the table names - and get the corresponding `index` values, assuming it's the first that matters. Then slice using `.loc`. – Stefan Dec 09 '15 at 19:57
1

Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csv function. It only requires table names and will output .csv files named after each table.

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths

table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]

def split_csv(csv_path, table_names):
    tables_infos = detect_tables_from_csv(csv_path, table_names)
    for table_info in tables_infos:
        split_csv_by_indexes(csv_path, table_info)

def split_csv_by_indexes(csv_path, table_info):
    title, start_index, end_index = table_info
    print title, start_index, end_index
    dir_ = dirname(csv_path)
    output_path = join(dir_, title) + ".csv"
    with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
        writer = csv.writer(output_file)
        reader = csv.reader(input_file)
        for i, line in enumerate(reader):
            if i < start_index:
                continue
            if i > end_index:
                break
            writer.writerow(line)

def detect_tables_from_csv(csv_path, table_names):
    output = []
    with open(csv_path, 'rb') as csv_file:
        reader = csv.reader(csv_file)
        for idx, row in enumerate(reader):
            for col in row:
                match = [title for title in table_names if title in col]
                if match:
                    match = match[0] # get the first matching element
                    try:
                        end_index = idx - 1
                        start_index
                    except NameError:
                        start_index = 0
                    else:
                        output.append((previous_match, start_index, end_index))
                    print "Found new table", col
                    start_index = idx
                    previous_match = match
                    match = False

        end_index = idx  # last 'end_index' set to EOF
        output.append((previous_match, start_index, end_index))
        return output


if __name__ == '__main__':
    csv_path = 'switch_records.csv'
    try:
        split_csv(csv_path, table_names)
    except IOError as e:
        print "This file doesn't exist. Aborting."
        print e
        exit(1)
JahMyst
  • 1,616
  • 3
  • 20
  • 39