-2

I need to get information from Excel file using pandas. For example we have this

    Time  Value1  Value2
0    10     30.5    21.6
1    11      11     50.2
2    13      13     13.33
3    15     101.1    2
4    23       5      5
5    4      11.1     10

I need to make DataFrame with all columns BUT filtered columns. I need to have a sum of each two rows and put it in DataFrame. Finally,

    Time  Value1  Value2
0    21     41.5    71.8
1    28     114.1   15.33
2    27     16.1     15

Trying to read all file firstly and then change it - IS NOT A SOLUTION. I need to make a dataframe with filtered columns already. Thank you :*

ego_xxx
  • 121
  • 3

1 Answers1

0

As per your comment, to read four different sheets of that size into a single df you would first have to

  • Make sure all sheets have roughly the same column count AND column names. If the names don't match then new columns will be made, making your df exponentially larger but with nans in many cells.

  • Load and process one sheet at a time so you can dispose of the original in-memory loaded data before reading the next sheet.

You shouldn't run out of memory, as reading an 8k*600 sheet should take about 40MB, but in case you come across larger datasets:

  • If you still run out of memory, try casting to an np.float32, possibly compromising some precision and a lot of performance, but getting half the memory usage.

  • Or you can process one sheet and save it to a binary file such as parquet or hdf before moving on to the next one. Ultimately if you need to do computations on the full dataset (and the set is too large for memory) you might need to look into out-of-core algorithms, but that is another story.

If it all fits in memory, then you can read with

df = [] # list of dfs
for sht in list_with_sheetnames:
    buffer = pd.read_excel(pth+filename, sheet_name=sht)
    df.append(buffer.groupby(buffer.index // 2).sum())
    del buffer
df = pd.concat(df, ignore_index=True) # convert from list of dfs to single df

You can make a grouper that increases every two rows and then sum the rows.

After you read the data from excel:

df = df.groupby(df.index // 2).sum()
RichieV
  • 5,103
  • 2
  • 11
  • 24