How to clean up extra heading info in middle of csv with pandas

Question

I have a csv file that I am trying to convert into a data frame. But the data has some extra heading material that gets repeated. For example:

Results Generated Date Time  
Sampling Info  
Time; Data  
1; 4.0  
2; 5.2  
3; 6.1  

Results Generated Date Time  
Sampling Info   
Time; Data  
6; 3.2   
7; 4.1   
8; 9.7

If it is a clean csv file without the extra heading material, I am using

df = pd.read_csv(r'Filelocation', sep=';', skiprows=2)

But I can't figure out how to remove the second set of header info. I don't want to lose the data below the second header set. Is there a way to remove it so the data is clean? The second header set is not always in the same location (basically a data acquisition mistake).
Thank you!

I hope my code help you, I [posted it in stack overflow](https://stackoverflow.com/a/64960024/13065379) — MrAhmedElsayed, Nov 22 '20 at 22:19

score 0 · Answer 1 · answered Jun 05 '20 at 18:41

Try to split your text file after the first block of data. Then you can make two dataframes out of it and concatenate them.

with open("yourfile.txt", 'r') as f:
    content = f.read()

# Make a list of subcontent
splitContent = content.split('Results Generated Date Time\nSampling Info\n')

Using "Results Generated Date Time\nSampling Info\n" as the argument for split, also removes those lines - This only works if the unnecessary header lines are always equal!

After this you get a list of your data as strings (variable: splitContent) separated by a delimiter (';'). Use this Answer to create dataframes from strings: https://stackoverflow.com/a/22605281/11005812.

Another approach could be to save each subcontent as a own file and read it again.

Concatening dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

snehil · Accepted Answer · 2020-06-05T18:58:00.807

import pandas as pd

filename = 'filename.csv'
lines =open(filename).read().split('\n')   # reading the csv file

list_ = [e for e in lines if e!='' ]  # removing '' characters from lines list

list_ = [e for e in list_ if e[0].isdigit()]  # removing string starting with non-numeric characters 

Time = [float(i.split(';')[0]) for i in list_]   # use int or float depending upon the requirements

Data = [float(i.split(';')[1].strip()) for i in list_]


df = pd.DataFrame({'Time':Time, 'Data':Data})    #making the dataframe 
df

I hope this will do the work !

How to clean up extra heading info in middle of csv with pandas

2 Answers2