Extract certain data from thousand files and put into excel sheet column by column

Question

I have around 3 thousand data files (in .witsml format shown below opened in notepad). I want to extract the time and the data set enclosed by two red lines shown. Each file has same number of rows fyi.

The catch here is each file contains data for different time stamp. So, after extracting the required data from one file I would need to place those on an excel sheet column. Then the next file data, after extraction should be placed to the next column and so on.

I have beginner level knowledge on python, Matlab. Bearing that in mind, would recommend the structure of the code to convert all these 3k files into .csv format with side by side column? enter image description here

Nic · Answer 1 · 2021-02-03T02:29:56.250

Taking a short look at it and reading over the doxumentation for this type of documents, this seems to be an XML format.

I found this for example about loading xml to pandas

From the dataframe you can simply save this to a csv via

pandas.DataFrame.to_csv()

Did you try something like this? pandas has also builtin functions for datestamps anyway.

As direct loading seems to have some difficulties you could try to read the files via python, for instance:

from datetime import datetime
def process_time(timestring):
    # extract time data here - if every file is same, always pass third line
    # make use of datetime if helpfull e.g. strptime

    return datestamp
def process_data(datastringlist):
    # IN: list with all lines with data in it
    datalist = []
    for dataline in datastringlist:
         start = dataline.find('<data>')
         end = dataline.find('</data>')
         data = dataline[start+len('<data>'):end]
         datalist.append(data)

    return datalist

if __name__ == '__main__':
    with open('myfile.xml','r') as f:
         linelist = f.readlines()

    dstamp = process_time(linelist[2]) # if time is always there
    datalist = process_data(linelist[9:]) # maybe check before where <data> starts

Nic, thanks for your comment. I was trying the second method of the code (link below). But it doesn't read the data i was intending to extract. The issue could be that the link xml file has attribute names i.e. name, age etc. whereas for my case, all my data has Data id=1, and no attribute name. So, I dont know how to get around this issue. https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c — Shahriar Mahmud, Feb 03 '21 at 00:30
Hey, since you only need the data info which in encapsulated anyway, how about simply reading the file via python for the wanted info e.g. search for the first occurence of and last occurence of and also the timestamp? maybe also the urllib from python can be helpfull there (although I only used it for websited so far but it is somehow the same) and the datetime lib for your time. Create lists from that and then a dataframe and export this to csv - if needed. Because who wants to look at 3000 excels? — Nic, Feb 03 '21 at 02:04

Extract certain data from thousand files and put into excel sheet column by column

1 Answers1