1

I want to interpret data output from a machine, that is bundled in a CSV file. I want to output one large table that is easier to analyze.

The input data may look like this:

marker,
info 1, info 2 \r\n
col1,col2,col3,col4,col5,colA,colB,colC \r\n
NULL,2e15,-222,info,string,0.17,b,c \r\n
... \r\n
marker,
info 3, info 4 \r\n
col1,col5,colA,colB,colD \r\n
text,foo,0.17,-1e-12,string \r\n

in words, there are comma separated data tables, distinguishable by some marker string describing what's below. The tables have more columns than those information lines. The tables have column headers, and they are not the same for every table. The tables have unequal amounts of rows.

I'd like the output to have the table information incorporated in separate columns, and all the various table columns concatenated.

my current code works OK but it feels crude and I am wondering if I can do this all straight from pandas.read_csv():

f = "test.csv"
locfile = locfile = open(f)
d={} #create an empty dictionary to load dataframes into
j=0 #I'm using this counter to find out how many etch tables there are in each CSV file
read_df = pd.read_csv(f, header=None, sep='\n') #read out the data in pandas 
test_df = read_df.loc[read_df[0].str.contains('marker')] #use .loc to find the expression separating the tables
step_holder = list(test_df.index.values) #store the first line of each table
step_holder.append(read_df.shape[0]) #add the bottom line of the last table
difference_step = [j-i for i, j in zip(step_holder[:-1], step_holder[1:])]#determine table length for each table
for line in locfile:
    if len(line)>1 and line.split(",")[0] == "marker": #here the header information is collected
        nextline= next(locfile)
        nextline_row = re.split(';|,|\*|\n',nextline)
        info1 = nextline_row[0]
        info2 = nextline_row[1] 
        header_info = [info1,info2]
        j+=1      
    if len(line)>1 and line.split(",")[0] == "col1": #here the actual process information is collected
        line_split = re.split(';|,|\*',line) #this first line will form the header information
        headers = ["info1","info2"]
        headers.extend(line_split)
        l = [] #create an empty list for storing all the rows in the data table
        for k in range(difference_step[j-1]): 
            row = []
            nextline= next(locfile) 
            nextline_row = re.split(';|,|\*',nextline)
            row = header_info+nextline_row
            l.append(row)            
        df = pd.DataFrame(l,columns=headers) #turn the list of rows into a dataframe
        d[f+str(j)] = df #add each dataframe to the dict to store them with unique IDs     
df = pd.concat([v for k,v in d.items()]) #concatenate all the dataframes in the dictionary
df.to_csv(outputf) 

is there a way to avoid this for loop by rearranging the data directly in pandas?

skleijn
  • 95
  • 10
  • 1
    Have you already looked at this issue? https://stackoverflow.com/questions/34184841/python-pandas-read-csv-file-containing-multiple-tables – Adrian Nov 22 '21 at 12:40

1 Answers1

1

OK @Adrian showed me the way and I could replace the whole routine with just 5 lines of code. <3

file_loc = "test.csv"
list = ['marker']
read_df = pd.read_csv(file_loc, header=None, names=range(60)) #read in the csv file. names tells python how many columns to expect
groups = read_df[0].isin(list).cumsum() #store the recurrence of 'marker'
tables = {'process'+str(k): g.iloc[0:] for k,g in read_df.groupby(groups)} #make a dictionary of dataframes

Then I do the makeup of the dataframes in tables using a for loop.

ouflak
  • 2,458
  • 10
  • 44
  • 49
skleijn
  • 95
  • 10