I want to interpret data output from a machine, that is bundled in a CSV file. I want to output one large table that is easier to analyze.
The input data may look like this:
marker,
info 1, info 2 \r\n
col1,col2,col3,col4,col5,colA,colB,colC \r\n
NULL,2e15,-222,info,string,0.17,b,c \r\n
... \r\n
marker,
info 3, info 4 \r\n
col1,col5,colA,colB,colD \r\n
text,foo,0.17,-1e-12,string \r\n
in words, there are comma separated data tables, distinguishable by some marker string describing what's below. The tables have more columns than those information lines. The tables have column headers, and they are not the same for every table. The tables have unequal amounts of rows.
I'd like the output to have the table information incorporated in separate columns, and all the various table columns concatenated.
my current code works OK but it feels crude and I am wondering if I can do this all straight from pandas.read_csv()
:
f = "test.csv"
locfile = locfile = open(f)
d={} #create an empty dictionary to load dataframes into
j=0 #I'm using this counter to find out how many etch tables there are in each CSV file
read_df = pd.read_csv(f, header=None, sep='\n') #read out the data in pandas
test_df = read_df.loc[read_df[0].str.contains('marker')] #use .loc to find the expression separating the tables
step_holder = list(test_df.index.values) #store the first line of each table
step_holder.append(read_df.shape[0]) #add the bottom line of the last table
difference_step = [j-i for i, j in zip(step_holder[:-1], step_holder[1:])]#determine table length for each table
for line in locfile:
if len(line)>1 and line.split(",")[0] == "marker": #here the header information is collected
nextline= next(locfile)
nextline_row = re.split(';|,|\*|\n',nextline)
info1 = nextline_row[0]
info2 = nextline_row[1]
header_info = [info1,info2]
j+=1
if len(line)>1 and line.split(",")[0] == "col1": #here the actual process information is collected
line_split = re.split(';|,|\*',line) #this first line will form the header information
headers = ["info1","info2"]
headers.extend(line_split)
l = [] #create an empty list for storing all the rows in the data table
for k in range(difference_step[j-1]):
row = []
nextline= next(locfile)
nextline_row = re.split(';|,|\*',nextline)
row = header_info+nextline_row
l.append(row)
df = pd.DataFrame(l,columns=headers) #turn the list of rows into a dataframe
d[f+str(j)] = df #add each dataframe to the dict to store them with unique IDs
df = pd.concat([v for k,v in d.items()]) #concatenate all the dataframes in the dictionary
df.to_csv(outputf)
is there a way to avoid this for loop by rearranging the data directly in pandas?