0

I have data query in my excel workbook to load data from a CSV-file. I want to use the Power Query-Editor to transform the data before working with them actually.

The simplified structure of the CSV-file are as shown in the html-table snippet below:

<table style="text-align:center">
  <tr>
    <th style="width:100px">Response-ID</th>
    <th style="width:80px">Task-ID</th>
    <th style="width:200px">Executed Action</th>
    <th style="width:200px">Status</th>
    <th style="width:80px">Date</th>
  </tr>

  <tr>
    <td>1</td>
    <td>A</td>
    <td>Some Task to be done</td>
    <td>Open</td>
    <td>20.01.2020</td>
  </tr>

  <tr>
    <td>2</td>
    <td>A</td>
    <td>Another Action done today</td>
    <td>Open</td>
    <td>25.01.2020</td>
  </tr>

  <tr>
    <td>3</td>
    <td>B</td>
    <td>New Task</td>
    <td>Open</td>
    <td>25.01.2020</td>
  </tr>

  <tr>
    <td>4</td>
    <td>A</td>
    <td>More actions in this matter</td>
    <td><b>Done</b></td>
    <td>26.01.2020</td>
  </tr>

  <tr>
    <td>5</td>
    <td>C</td>
    <td>Another Task</td>
    <td>Open</td>
    <td>27.02.2020</td>
  </tr>

  <tr>
    <td>6</td>
    <td>B</td>
    <td>Whatever</td>
    <td><b>Done</b></td>
    <td>29.02.2020</td>
  </tr>

  <tr>
    <td>7</td>
    <td>D</td>
    <td>Now what?</td>
    <td>Open</td>
    <td>02.03.2020</td>
  </tr>

  <tr>
    <td>8</td>
    <td>C</td>
    <td>Almost done...</td>
    <td>Open</td>
    <td>02.03.2020</td>
  </tr>
</table>

What I want to achieve is a Power Query M code to display only the latest entry from each unfinished Task-ID.

Here is what I have so far:

let
    Source = ResponseData //a copy of my actual data query "ResponseData" to work with in this case
    #"Transformed Table" = Table.Distinct(Table.Buffer(Table.Sort(Source, {{"Date", Order.Descending}})), "Task-ID"),
    #"Remove Finished Tasks" = Table.SelectRows(#"Transformed Table", each ([Status] <> "Done" and [Status] <> "Canceled"))
in
    #"Remove Finished Tasks"

This code actually returns a list of all open tasks. But against my expectation it only contains the oldest response-ID of each task instead of the newest one. Changing the sort-order in the #"Transformed Table"-line returns the very same result in a reversed order. So basically sorting the table has no influence on the resulting data at all...

EDITED the code above according to the answer by Ron Rosenfeld -> Buffering the sorted table does the trick.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
GuidoT
  • 280
  • 1
  • 12

1 Answers1

2

For reasons I don't fully understand (seems to have something to do with query folding), Power Query Remove Duplicates when run on a certain data sources, apparently does not respect the sorting order. But if you use Table.Buffer, then it ensures that folding will not take place and that the Remove Duplicates will take place after the sorting/filtering.

Not sure which would be faster on large datasets, though -- Table.Buffer versus Table.Group

CSV File

Response-ID,Task-ID,Executed Action,Status,Date
1,A,Some Task to be done,Open,20.01.2020
2,A,Another Action done today,Open,25.01.2020
3,B,New Task,Open,25.01.2020
4,A,More actions in this matter,Done,26.01.2020
5,C,Another Task,Open,27.02.2020
6,B,Whatever,Done,29.02.2020
7,D,Now what?,Open,02.03.2020
8,C,Almost done...,Open,02.03.2020

First few lines of MCode is a bit different, but this should give you the gist:

let
    Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\test.csv"),[Delimiter=",", Columns=5, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Response-ID", Int64.Type}, {"Task-ID", type text}, {"Executed Action", type text}, {"Status", type text}, {"Date", type text}}),
    #"Changed Type with Locale" = Table.TransformColumnTypes(#"Changed Type", {{"Date", type date}}, "en-150"),
    #"Filtered Rows" = Table.SelectRows(#"Changed Type with Locale", each ([Status] = "Open")),
    #"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"Date", Order.Descending}}),
    #"Buffered Table" = Table.Buffer(#"Sorted Rows"),
    #"Removed Duplicates" = Table.Distinct(#"Buffered Table", {"Task-ID"}),
    #"Sorted Rows1" = Table.Sort(#"Removed Duplicates",{{"Task-ID", Order.Ascending}})
in
    #"Sorted Rows1"

Results

enter image description here

Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
  • Yes! That's it! Never had I expected such a flaw in this feature... Buffering the sorted table does the trick. – GuidoT Mar 04 '20 at 07:35
  • Just out of curiosity as a beginner to Power Query M: How would the solution look like with grouping? I didn't think of grouping as a feature to just store data. Would it be faster? Over time there will be many entries (probabably way over 10.000) in my table. – GuidoT Mar 04 '20 at 07:45
  • @GuidoT To use grouping and also retain the column data that does not participate in the grouping gets pretty complex. I suggest you stick with the `Table.Buffer` method unless you find it quite slow on large data, at which time post a new question. If you don't need the non-participatory columns, then merely select the `Task-Id` column and group by on that, with **Max** as the operation on the `Date` column. – Ron Rosenfeld Mar 04 '20 at 12:06
  • @GuidoT One way of retaining the non-participatory columns is by duplicating the query, doing your filter/group on one, and then joining the appropriate rows. And I don't know which way would be faster. But if it comes to that, you can test both methods. – Ron Rosenfeld Mar 04 '20 at 13:05
  • @Darren Bartrup-Cook – Dreekun Mar 05 '20 at 12:46