Nested dictionary groups from excel

Question

I'm new in python and openpyxl. I started to learn in order to make my every day tasks easier and faster at my workplace.

Task: There is an excel file with a lots of rows, looks like this excel file

I want to create a daily report based on this excel file. In my example Today is 2019/05/08.

Expected result: Only show the info where the date is match with Today date. Expected structure:

required outcome

My solution In my solution I create a list of the rows where I can find only the Today values. After that I read only that rows and create dictionaries. But the result is nothing. I also in a trouble about how to work with multiple keys. Because there are multiple issue numbers are in the list.

from datetime import datetime
import openpyxl
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
from openpyxl.utils import column_index_from_string

#Open excel file
excel_path = "\\REE.xlsx"
wb = openpyxl.load_workbook(excel_path, data_only=True)
ws_1 = wb.worksheets[1]

#The Today date. need some format due to excel date handling
today = datetime.today()
today = today.replace(hour=00, minute=00, second=00, microsecond=00)

#Crate a list of the lines where only Today values are present
issue_line_list = []
for cell in ws_1["B"]:
    if cell.value == today:
        issue_line = cell.row
        issue_line_list.append(issue_line)

#Creare a txt file for output
file = open("daily_report.txt", "w")

#The dict what I want to use
dict = []
issue_numbers_list = []
issue = []

#Create a dict for the issues
for line in issue_line_list:
    issue_number_value = ws_1.cell(row = line, column = 3).value
    issue_numbers_list.append(issue_number_value)

#Create a dict for other information
for line in issue_line_list:
    issue_number_value = ws_1.cell(row = line, column = 3).value
    by_value = ws_1.cell(row = line, column = 2 ).value
    group_value = ws_1.cell(row = line, column = 4).value
    events_value = ws_1.cell(row = line, column = 5).value
    deadline_value = ws_1.cell(row = line, column = 6).value
    try:
        deadline_value = deadline_value.strftime('%Y.%m.%d')
    except:
        deadline_value = ""

    issue.append(issue_number_value)
    issue.append(by_value)
    issue.append(group_value)
    issue.append(events_value)
    issue.append(deadline_value)
    issue.append(deadline_value)

#Append the two dict
dict.append(issue_numbers_list)
dict.append(issue)

#Save it to the txt file.
file.write(dict)
file.close()

Questions - How to solve the multiple same key issue? - How to create nested groups? - What should add or delete to my code in order to get the expected result?

Remark Openpyxl is not only option. If you have a bettwer/easier/faster way I open for every idea.

Thank you in advance for you support!

can you show a sample of your excel file, and the required output. — Jeril, May 09 '19 at 06:32
Hello @Jeril. Pictures attached, but here are the links.excel:https://i.stack.imgur.com/d1Ciq.jpg outcome:https://i.stack.imgur.com/b7pgl.jpg — Denes, May 09 '19 at 07:31
Relevant [filtering-pandas-dataframes-on-dates](https://stackoverflow.com/questions/22898824/filtering-pandas-dataframes-on-dates) — stovfl, May 09 '19 at 09:37
you can use use a library names `pandas`, and make use of its `groupby` operation — Jeril, May 09 '19 at 09:58

Jeril · Answer 1 · 2019-05-10T07:42:27.040

0

Can you try the following:

import pandas as pd
cols = ['date', 'by', 'issue_number', 'group', 'events', 'deadline']
req_cols = ['events', 'deadline']
data = [
    ['2019-05-07', 'john', '113140', '@issue_closed', 'something different', ''],
    ['2019-05-08', 'david', '113140', '@task', 'something different', ''],
    ['2019-05-08', 'victor', '114761', '@task_result', 'something different', ''],
    ['2019-05-08', 'john', '114761', '@task', 'something different', '2019-05-10'],
    ['2019-05-08', 'david', '114761', '@task',
        'something different', '2019-05-08'],
    ['2019-05-08', 'victor', '113140', '@task_result', 'something different', ''],
    ['2019-05-07', 'john', '113140', '@issue_created',
        'something different', '2019-05-09'],
    ['2019-05-07', 'david', '113140', '@location', 'something different', ''],
    ['2019-05-07', 'victor', '113140', '@issue_closed', 'something different', 'done'],
    ['2019-05-07', 'john', '113140', '@task_result', 'something different', ''],
    ['2019-05-07', 'david', '113140', '@task',
        'something different', '2019-05-10'],
]
df = pd.DataFrame(data, columns=cols)
df1 = df.groupby(['issue_number', 'group']).describe()[req_cols].droplevel(0, axis=1)['top']
df1.columns = req_cols
print(df1)

Output:

                                          events    deadline
issue_number group                                          
113140       @issue_closed   something different        done
             @issue_created  something different  2019-05-09
             @location       something different            
             @task           something different  2019-05-10
             @task_result    something different            
114761       @task           something different  2019-05-08
             @task_result    something different

To open an excel file, you can do the following:

df = pd.read_excel(excel_path, sheet_name=my_sheet)
req_cols = ['EVENTS', 'DEADLINE']
df1 = df.groupby(['ISSUE NUMBER', 'GROUP']).describe()[req_cols].droplevel(0, axis=1)['top']
df1.columns = req_cols
print(df1)

edited May 10 '19 at 07:42

answered May 10 '19 at 05:03

Jeril

7,858
3
52
69

If I try only the first code I get the following error: "AttributeError: 'DataFrame' object has no attribute 'droplevel" If I change 'df = pd.DataFrame(data, columns=cols)' to 'df = pd.read_excel(excel_path, columns=cols)' I got a following error: "KeyError: 'ISSUE NUMBER'" – Denes May 10 '19 at 05:32
the first code worked for me. which python version are you using, and which pandas version are you using – Jeril May 10 '19 at 05:50
can you share the output of this, `pd.read_excel(filename).columns` – Jeril May 10 '19 at 05:51
The Key error issue solved. I forgot to add the sheet name. There are more sheets in my excel. but here is the output: Index(['DATE', 'BY', 'ISSUE NUMBER', 'GROUP', 'EVENTS', 'DEADLINE', 'REF'], dtype='object') No only have 'Droplevel' issue. Python: 3.7.1 Panda: 0.23.4 – Denes May 10 '19 at 06:02
is this `df = pd.read_excel(excel_path, columns=cols)` working now? – Jeril May 10 '19 at 06:14
Yes. I use this: df = pd.read_excel(excel_path, sheet_name = my_sheet, columns=cols). I also updated pandas so now your code work with data input. But if I want to use with excel I got a following error: KeyError: "['EVENTS' 'DEADLINE'] not in index" – Denes May 10 '19 at 06:33
With my excel file it is a some situation. But I tried you original script a little change in data. I change one of the task_result to task and the issue is the same. Only print the first task and skip the rest task line. Can you please check my second code. My code is working with my excel file and provide a same result. – Denes May 10 '19 at 09:05
you need to update the following `req_cols = ['date', 'by', 'events', 'deadline']` – Jeril May 10 '19 at 10:51
Updated, but the issue is the same. Only print once 'BY' column finding. – Denes May 10 '19 at 10:57
can you help me understand what exactly you want – Jeril May 10 '19 at 10:59

Denes · Answer 2 · 2019-05-10T11:27:46.957

The task almost solved, but I faced a new issue.

The code:

excel_path = "\\REE.xlsx"
my_sheet = 'Events'

cols = ['DATE', 'BY', 'ISSUE NUMBER', 'GROUP', 'EVENTS', 'DEADLINE']
req_cols = ['EVENTS', 'DEADLINE']

df = pd.read_excel(excel_path, sheet_name = my_sheet, columns=cols)

today = datetime.today().strftime('%Y-%m-%d')
today_filter =  (df[(df['DATE'] == today)])

df = pd.DataFrame(today_filter, columns=cols)
df1 = df.groupby(['ISSUE NUMBER', 'GROUP']).describe()[req_cols].droplevel(0, axis=1['top']
df1.columns = req_cols

print(df1)

On the 'BY' column there are same values. eg. '@task'. But the script print only once.

int his case

Required result:

114761
      @task Jane another words 2019-05-10
      @task result John something
      @task John something else 2019-05-08
 ...
 ...
 ...
 ...

My code result:

114761
      @task Jane another words 2019-05-10
      @task result John something
 ...
 ...
 ...

John @task something else 2019-05-08 do not print it out. Why?

And there is a some result in other options also. If there are more some values at'BY' column the script print out only the first and skip the rest.

you need to update the following: `req_cols = ['date', 'by', 'events', 'deadline']` — Jeril, May 10 '19 at 10:51
@Jeril I updated the answer with required and current result. Please check it. I hope it will help you to understand my problem. — Denes, May 10 '19 at 11:29

Nested dictionary groups from excel

2 Answers2