Is there a way to organize a two column csv data file so that it breaks to a new line after a certain time interval has passed (i.e. 365 days)?

Question

The csv file I have is data that was collected into two big columns: one is the daily data I need, and the other is the variable associated on the day. I would like to break off after one year of daily data and display that data in a pandas table where the beginning to end of the first year is displayed horizontally and the variable should be just under. So, to clarify, the years should have their own column, then the daily data displayed horizontally, and variable data displayed under the respective day in the right year. Right now all of my data is combined so all the time data is in one long column, and I cannot really build my code to work with it that way.

I have looked this up before but have not found anything that really describes my situation. I was thinking maybe a splice function might help, but I don't know enough to do that and I'd also like it to be dynamic. It needs to be able to modify to other csv files, so that they read in starting at the first of one year and then display as the table.

example:

my data file looks like this:

Time Temperature
1900-01-01 50
1900-01-02 52
1900-01-03 55
... ...
2019-04-05 60

So I would like to separate the years into one column, and have it sorted by the day horizontally(Jan 1...Dec 31) as a row header, and then sort the corresponding temperature data below into the correct year.

Especially output would be interesting with regards to leap years. — SpghttCd, Apr 09 '19 at 05:40

score 0 · Answer 1 · answered Apr 09 '19 at 03:55

This is definitely very possible. The steps will be to first break up all your data into an array.

const rawText = // import your csv somehow
const rawRows = rawText.split("\n"); // split at newlines

Then map your raw rows into the actual data

const rows = rawRows.map((row) => {
  const cells = row.split(",");
  return {
   date: new Date(cells[0]), 
   value: parseInt(cells[1])
  };
});

I recommend using console.logs as you figure this stuff out. Make sure each stage gives you an object you like. Then once you have all your rows, start sorting them into years.

const years = [];
// this could technically be done in the loop above
rows.forEach((row) => {
  const year = row.date.getYear() - START_YEAR;
  years[year] = row;
});

Then regen your years array into a csv. A csv is "comma separated values", with a newline at the end of each row.

let csv = "";
const mostEntriesYear = years.reduce((a, b) => Math.max(a.length, b.length));
for (let entry = 0; entry < mostEntriesYear; entry++) {
  for (let year = 0; year < years.length; year++) {
    if (year != 0) { csv += ","; }
    const row = years[year];
    csv += row.date + "," + row.value;
  }
  csv += "\n";
}

None of this code is tested, but you should be able to piece it together into the thing you're looking for.

score 0 · Answer 2 · answered Apr 09 '19 at 04:03

I have a different script, modified from here. What this does is create a different csv file for every year, assuming no header line, and no line is empty:

lines_per_file = 365    
smallfile = open("small_file_1900.csv", "w")    
# Just a seed for the beginning    
oldyear='1900'    
with open('file.csv') as bigfile:    
    for lineno, line in enumerate(bigfile):    
        year = line[:4]    
        if oldyear != year:    
            if smallfile:    
                smallfile.close()    
            small_filename = 'small_file_{}.csv'.format(year)    
            smallfile = open(small_filename, "w")    
        smallfile.wricodete(line)    
        oldyear=year    
    if smallfile:    
        smallfile.close()

SpghttCd · Answer 3 · 2019-04-09T12:16:37.807

You can use a pandas Multiindex together with unstack:

Creating your mcve:

# from io import StringIO
# import pandas as pd

# s = '''Time Temperature
#  1900-01-01 50
#  1900-01-02 52
#  1900-01-03 55
#  1901-01-01 51
#  1901-01-02 53
#  1901-01-03 56
#  1901-02-28 66
#  1902-01-01 52
#  1902-01-02 54
#  1902-01-03 57
# 2019-04-05 60'''

df = pd.read_csv(StringIO(s), delim_whitespace=True, index_col=0)

produces this dataframe:

#             Temperature
# Time                   
# 1900-01-01           50
# 1900-01-02           52
# 1900-01-03           55
# 1901-01-01           51
# 1901-01-02           53
# 1901-01-03           56
# 1901-02-28           66
# 1902-01-01           52
# 1902-01-02           54
# 1902-01-03           57
# 2019-04-05           60

Then, you can create a Multiindex with year and month/day-part of your datestrings:

idx = pd.MultiIndex.from_tuples(list(zip(df.index.str[:4], df.index.str[-5:])), names=['year', 'date'])
df.index = idx

so that your dataframe looks like

#             Temperature
# year date              
# 1900 01-01           50
#      01-02           52
#      01-03           55
# 1901 01-01           51
#      01-02           53
#      01-03           56
#      02-28           66
# 1902 01-01           52
#      01-02           54
#      01-03           57
# 2019 04-05           60

on which you can simply apply unstack():

df.unstack()

     Temperature                        
date       01-01 01-02 01-03 02-28 04-05
year                                    
1900        50.0  52.0  55.0   NaN   NaN
1901        51.0  53.0  56.0  66.0   NaN
1902        52.0  54.0  57.0   NaN   NaN
2019         NaN   NaN   NaN   NaN  60.0

Edit: I just saw you want dates in rows and years in columns, so you should set the levels-kwarg of unstack to 0:

df.unstack(0)

      Temperature                  
year         1900  1901  1902  2019
date                               
01-01        50.0  51.0  52.0   NaN
01-02        52.0  53.0  54.0   NaN
01-03        55.0  56.0  57.0   NaN
02-28         NaN  66.0   NaN   NaN
04-05         NaN   NaN   NaN  60.0

Is there a way to organize a two column csv data file so that it breaks to a new line after a certain time interval has passed (i.e. 365 days)?

3 Answers3