I'm reading an accounting file for cities. My goal is to offer some informative subtotals for each accounting number of each establishment :
Some columns, named from (cumulSD3, cumulSC3) to (cumulSD7, cumulSC7) are added to the records, and aggregates soldeDebiteur and soldeCrediteur for root accounts : account number 13248
will aggregate under 13248
, 1324
and 132
levels, in example.
+--------------------------+----------+-----------------+---------------------+---------------------+---------+----------+------------+-----------+------------+----------+---------------------+-----------+------------+------------------+-------------------+------------------------+-------------------------+---------------------------+----------------------------+-----------------------------+------------------------------+-------------+--------------+-------------+---------------+--------------------------+--------+--------+-----------------------------------------------------------------------------------------------------+-------------------------+------------+----------------+----------------+----------+----------+----------------+----------+----------+----------------+----------+---------+---------------+-----------+--------------+----------------+--------+---------+
|libelleBudget |typeBudget|typeEtablissement|sousTypeEtablissement|nomenclatureComptable|siren |codeRegion|codeActivite|codeSecteur|numeroFINESS|codeBudget|categorieCollectivite|typeBalance|numeroCompte|balanceEntreeDebit|balanceEntreeCredit|operationBudgetaireDebit|operationBudgetaireCredit|operationNonBudgetaireDebit|operationNonBudgetaireCredit|operationOrdreBudgetaireDebit|operationOrdreBudgetaireCredit|soldeDebiteur|soldeCrediteur|anneeExercice|budgetPrincipal|nombreChiffresNumeroCompte|cumulSD7|cumulSC7|libelleCompte |nomenclatureComptablePlan|sirenCommune|populationTotale|numeroCompteSur3|cumulSD3 |cumulSC3 |numeroCompteSur4|cumulSD4 |cumulSC4 |numeroCompteSur5|cumulSD5 |cumulSC5 |codeDepartement|codeCommune|siret |numeroCompteSur6|cumulSD6|cumulSC6 |
+--------------------------+----------+-----------------+---------------------+---------------------+---------+----------+------------+-----------+------------+----------+---------------------+-----------+------------+------------------+-------------------+------------------------+-------------------------+---------------------------+----------------------------+-----------------------------+------------------------------+-------------+--------------+-------------+---------------+--------------------------+--------+--------+-----------------------------------------------------------------------------------------------------+-------------------------+------------+----------------+----------------+----------+----------+----------------+----------+----------+----------------+----------+---------+---------------+-----------+--------------+----------------+--------+---------+
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |1021 |0.0 |349139.71 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |349139.71 |2019 |true |4 |0.0 |0.0 |Dotation |M14 |210100012 |794 |102 |0.0 |995427.19 |1021 |0.0 |349139.71 |1021 |0.0 |0.0 |01 |01001 |21010001200017|1021 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |10222 |0.0 |554545.85 |0.0 |30003.0 |0.0 |0.0 |0.0 |0.0 |0.0 |584548.85 |2019 |true |5 |0.0 |0.0 |F.C.T.V.A. |M14 |210100012 |794 |102 |0.0 |995427.19 |1022 |0.0 |646287.48 |10222 |0.0 |584548.85|01 |01001 |21010001200017|10222 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |10223 |0.0 |4946.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |4946.0 |2019 |true |5 |0.0 |0.0 |T.L.E. |M14 |210100012 |794 |102 |0.0 |995427.19 |1022 |0.0 |646287.48 |10223 |0.0 |4946.0 |01 |01001 |21010001200017|10223 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |10226 |0.0 |41753.65 |0.0 |12078.54 |0.0 |0.0 |0.0 |0.0 |0.0 |53832.19 |2019 |true |5 |0.0 |0.0 |Taxe d’aménagement |M14 |210100012 |794 |102 |0.0 |995427.19 |1022 |0.0 |646287.48 |10226 |0.0 |53832.19 |01 |01001 |21010001200017|10226 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |10227 |0.0 |2960.44 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |2960.44 |2019 |true |5 |0.0 |0.0 |Versement pour sous-densité |M14 |210100012 |794 |102 |0.0 |995427.19 |1022 |0.0 |646287.48 |10227 |0.0 |2960.44 |01 |01001 |21010001200017|10227 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |1068 |0.0 |2281475.34 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |2281475.34 |2019 |true |4 |0.0 |0.0 |Excédents de fonctionnement capitalisés |M14 |210100012 |794 |106 |0.0 |2281475.34|1068 |0.0 |2281475.34|1068 |0.0 |0.0 |01 |01001 |21010001200017|1068 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |110 |0.0 |97772.73 |0.0 |0.0 |0.0 |112620.66 |0.0 |0.0 |0.0 |210393.39 |2019 |true |3 |0.0 |0.0 |Report à nouveau (solde créditeur) |M14 |210100012 |794 |110 |0.0 |210393.39 |110 |0.0 |0.0 |110 |0.0 |0.0 |01 |01001 |21010001200017|110 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |12 |0.0 |112620.66 |0.0 |0.0 |112620.66 |0.0 |0.0 |0.0 |0.0 |0.0 |2019 |true |2 |0.0 |0.0 |RÉSULTAT DE L'EXERCICE (excédentaire ou déficitaire) |M14 |210100012 |794 |12 |0.0 |0.0 |12 |0.0 |0.0 |12 |0.0 |0.0 |01 |01001 |21010001200017|12 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |1321 |0.0 |29097.78 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |29097.78 |2019 |true |4 |0.0 |0.0 |État et établissements nationaux |M14 |210100012 |794 |132 |0.0 |296722.26 |1321 |0.0 |29097.78 |1321 |0.0 |0.0 |01 |01001 |21010001200017|1321 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |1322 |0.0 |201.67 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |201.67 |2019 |true |4 |0.0 |0.0 |Régions |M14 |210100012 |794 |132 |0.0 |296722.26 |1322 |0.0 |201.67 |1322 |0.0 |0.0 |01 |01001 |21010001200017|1322 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |1323 |0.0 |163194.37 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |163194.37 |2019 |true |4 |0.0 |0.0 |Départements |M14 |210100012 |794 |132 |0.0 |296722.26 |1323 |0.0 |163194.37 |1323 |0.0 |0.0 |01 |01001 |21010001200017|1323 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |13248 |0.0 |1129.37 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |1129.37 |2019 |true |5 |0.0 |0.0 |Autres communes |M14 |210100012 |794 |132 |0.0 |296722.26 |1324 |0.0 |1129.37 |13248 |0.0 |1129.37 |01 |01001 |21010001200017|13248 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |13251 |0.0 |47079.11 |0.0 |2387.05 |0.0 |0.0 |0.0 |0.0 |0.0 |49466.16 |2019 |true |5 |0.0 |0.0 |GFP de rattachement |M14 |210100012 |794 |132 |0.0 |296722.26 |1325 |0.0 |49532.16 |13251 |0.0 |49466.16 |01 |01001 |21010001200017|13251 |0.0 |0.0 |
|ABERGEMENT-CLEMENCIAT (L')|1 |101 |00 |M14 |210100012|084 |40 |null |null |null |Commune |DEF |13258 |0.0 |66.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |66.0 |2019 |true |5 |0.0 |0.0 |Autres groupements |M14 |210100012 |794 |132 |0.0 |296722.26 |1325 |0.0 |49532.16 |13258 |0.0 |66.0 |01 |01001 |21010001200017|13258 |0.0 |0.0 |
To be clearer, retaining only the main fields involved in calculations, here's what my function focus on :
+--------------+------------+-------------+--------------+--------+--------+--------+--------+---------+---------+----------+----------+----------+----------+
| siret|numeroCompte|soldeDebiteur|soldeCrediteur|cumulSD7|cumulSC7|cumulSD6|cumulSC6| cumulSD5| cumulSC5| cumulSD4| cumulSC4| cumulSD3| cumulSC3|
+--------------+------------+-------------+--------------+--------+--------+--------+--------+---------+---------+----------+----------+----------+----------+
|21010001200017| 1021| 0.0| 349139.71| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 349139.71| 0.0| 995427.19|
|21010001200017| 10222| 0.0| 584548.85| 0.0| 0.0| 0.0| 0.0| 0.0|584548.85| 0.0| 646287.48| 0.0| 995427.19|
|21010001200017| 10223| 0.0| 4946.0| 0.0| 0.0| 0.0| 0.0| 0.0| 4946.0| 0.0| 646287.48| 0.0| 995427.19|
|21010001200017| 10226| 0.0| 53832.19| 0.0| 0.0| 0.0| 0.0| 0.0| 53832.19| 0.0| 646287.48| 0.0| 995427.19|
|21010001200017| 10227| 0.0| 2960.44| 0.0| 0.0| 0.0| 0.0| 0.0| 2960.44| 0.0| 646287.48| 0.0| 995427.19|
|21010001200017| 1068| 0.0| 2281475.34| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|2281475.34| 0.0|2281475.34|
|21010001200017| 110| 0.0| 210393.39| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 210393.39|
|21010001200017| 12| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0|
|21010001200017| 1321| 0.0| 29097.78| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 29097.78| 0.0| 296722.26|
|21010001200017| 1322| 0.0| 201.67| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 201.67| 0.0| 296722.26|
|21010001200017| 1323| 0.0| 163194.37| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 163194.37| 0.0| 296722.26|
|21010001200017| 13248| 0.0| 1129.37| 0.0| 0.0| 0.0| 0.0| 0.0| 1129.37| 0.0| 1129.37| 0.0| 296722.26|
|21010001200017| 13251| 0.0| 49466.16| 0.0| 0.0| 0.0| 0.0| 0.0| 49466.16| 0.0| 49532.16| 0.0| 296722.26|
|21010001200017| 13258| 0.0| 66.0| 0.0| 0.0| 0.0| 0.0| 0.0| 66.0| 0.0| 49532.16| 0.0| 296722.26|
|21010001200017| 1328| 0.0| 53566.91| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 53566.91| 0.0| 296722.26|
|21010001200017| 1341| 0.0| 142734.21| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 142734.21| 0.0| 145233.21|
|21010001200017| 1342| 0.0| 2499.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2499.0| 0.0| 145233.21|
|21010001200017| 1383| 0.0| 2550.01| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 2550.01| 0.0| 2550.01|
|21010001200017| 1641| 0.0| 236052.94| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 236052.94| 0.0| 236052.94|
This starts on an accounting file sorted by department, city code, account number, siret (our identifier for establishments).
However, lacking of knowledge, I'm doing something that breaks the heart :
First attempt that is costly but works, through RDD
/**
* Créer un dataset de cumuls de comptes parents par siret.
* @param session Session Spark.
* @param comptes Dataset des comptes de comptabilités de tous les siret.
* @return Dataset avec un siret associés à des cumuls par comptes à 7, 6, 5, 4, 3 chiffres, pour soldes de débit et soldes de crédit.
*/
private Dataset<Row> cumulsComptesParentsParSiret(SparkSession session, Dataset<Row> comptes) {
JavaPairRDD<String, Iterable<Row>> rddComptesParSiret = comptes.javaRDD().groupBy((Function<Row, String>)compte -> compte.getAs("siret"));
// Réaliser les cumuls par siret et compte, par compte parent.
JavaRDD<Row> rdd = rddComptesParSiret.flatMap((FlatMapFunction<Tuple2<String, Iterable<Row>>, Row>)comptesSiret -> {
String siret = comptesSiret._1();
AccumulateurCompte comptesParentsPourSiret = new AccumulateurCompte(siret);
for(Row rowCompte : comptesSiret._2()) {
String numeroCompte = rowCompte.getAs("numeroCompte");
Double soldeSD = rowCompte.getAs("soldeDebiteur");
Double soldeSC = rowCompte.getAs("soldeCrediteur");
comptesParentsPourSiret.add(numeroCompte, soldeSD, soldeSC);
}
// Faire une ligne de regroupement siret, compte et ses comptes parents.
List<Row> rowsCumulsPourSiret = new ArrayList<>();
for(Row rowCompte : comptesSiret._2()) {
String numeroCompte = rowCompte.getAs("numeroCompte");
double sd[] = new double[6];
double sc[] = new double[6];
for(int nombreChiffres = numeroCompte.length(); nombreChiffres >= 3; nombreChiffres--) {
String compteParent = numeroCompte.substring(0, nombreChiffres);
Double cumulDebits = comptesParentsPourSiret.getCumulSD(compteParent);
Double cumulCredits = comptesParentsPourSiret.getCumulSC(compteParent);
sd[nombreChiffres - 3] = cumulDebits != null ? Precision.round(cumulDebits, 2, BigDecimal.ROUND_CEILING) : 0.0;
sc[nombreChiffres - 3] = cumulCredits != null ? Precision.round(cumulCredits, 2, BigDecimal.ROUND_CEILING) : 0.0;
}
Row rowCumulsPourCompte = RowFactory.create(siret, numeroCompte, sd[4], sc[4], sd[3], sc[3], sd[2], sc[2], sd[1], sc[1], sd[0], sc[0]);
rowsCumulsPourSiret.add(rowCumulsPourCompte);
}
return rowsCumulsPourSiret.iterator();
});
return session.createDataFrame(rdd, schemaCumulComptesParents());
}
Second attempt, through datasets : efficient, but doesn't allow low-level management of accounting records
/**
* Cumuler les sous-comptes.
* @param comptes Dataset de comptes.
* @return Dataset aux cumuls de comptes à 3, 4, 5, 6, 7 chiffres réalisés, par commune.
*/
private Dataset<Row> cumulsSousComptes(Dataset<Row> comptes) {
Dataset<Row> comptesAvecCumuls = comptes;
for(int nombreChiffresNiveauCompte = 3; nombreChiffresNiveauCompte < 7; nombreChiffresNiveauCompte ++) {
comptesAvecCumuls = cumulsCompteParent(comptesAvecCumuls, nombreChiffresNiveauCompte);
}
return comptesAvecCumuls;
}
/**
* Cumul par un niveau de compte parent.
* @param comptes Liste des comptes.
* @param nombreChiffres Nombre de chiffres auquel réduire le compte à cummuler. Exemple 4 : 2041582 est cumulé sur 2041.
* @return cumuls par compte parent : dataset au format (cumul des soldes débiteurs, cumul des soldes créditeurs).
*/
private Dataset<Row> cumulsCompteParent(Dataset<Row> comptes, int nombreChiffres) {
// Cumuler pour un niveau de compte parent sur le préfixe de leurs comptes réduits à nombreChiffres.
Column nombreChiffresCompte = comptes.col("nombreChiffresNumeroCompte");
String aliasNumeroCompte = MessageFormat.format("numeroCompteSur{0}", nombreChiffres);
RelationalGroupedDataset group = comptes.groupBy(col("codeDepartement"), col("codeCommune"), col("siret"), col("numeroCompte").substr(1,nombreChiffres).as(aliasNumeroCompte));
String nomChampCumulSD = MessageFormat.format("cumulSD{0}", nombreChiffres);
String nomChampCumulSC = MessageFormat.format("cumulSC{0}", nombreChiffres);
Column sd = sum(when(nombreChiffresCompte.$greater$eq(lit(nombreChiffres)), col("soldeDebiteur")).otherwise(lit(0.0))).as(nomChampCumulSD);
Column sc = sum(when(nombreChiffresCompte.$greater$eq(lit(nombreChiffres)), col("soldeCrediteur")).otherwise(lit(0.0))).as(nomChampCumulSC);
Dataset<Row> cumuls = group.agg(sd, sc);
// Associer à chaque compte la colonne de cumuls de comptes parents, pour le niveau en question.
Column jointure =
comptes.col("codeDepartement").equalTo(cumuls.col("codeDepartement"))
.and(comptes.col("codeCommune").equalTo(cumuls.col("codeCommune")))
.and(comptes.col("siret").equalTo(cumuls.col("siret")))
.and(comptes.col("numeroCompte").substr(1, nombreChiffres).equalTo(cumuls.col(aliasNumeroCompte)));
Dataset<Row> comptesAvecCumuls = comptes.join(cumuls, jointure, "left_outer")
.drop(comptes.col("siret"))
.drop(comptes.col("codeDepartement"))
.drop(comptes.col("codeCommune"))
.drop(comptes.col(nomChampCumulSD))
.drop(comptes.col(nomChampCumulSC))
.withColumnRenamed("cumulSD", nomChampCumulSD)
.withColumnRenamed("cumulSC", nomChampCumulSC)
.withColumn(nomChampCumulSD, round(col(nomChampCumulSD), 2))
.withColumn(nomChampCumulSC, round(col(nomChampCumulSC), 2));
return comptesAvecCumuls;
}
By low level management, I mean : some last minute verifications to emit some warnings or exclude at summation time some values :
- If accounting nomenclature changes for one record among those of establishment.
- To warn about a value that seems strange, regarding to other knowledge I have.
What I would like to have
I need to browse the rows content of each group independently. One group after the other.
I would need a Spark function that would offer me to implement a call-back method, where :
- the first parameter would give the current keys values (for department code, city code, siret),
- and the second one, the rows associated with these keys.
Dataset<Row> eachGroupContent(Row keys, Dataset<Row> groupContent);
It would be successively called by Spark with these entries parameters :
Row (keys) : {Department : 01, City code : 01001, siret : 21010001200017}
Dataset<Row> (values) associated :
+---------------+-----------+--------------+------------+-------------+--------------+--------+
|codeDepartement|codeCommune| siret|numeroCompte|soldeDebiteur|soldeCrediteur|(others)|
+---------------+-----------+--------------+------------+-------------+--------------+--------+
| 01| 01001|21010001200017| 1021| 0.0| 349139.71| ...|
| 01| 01001|21010001200017| 10222| 0.0| 584548.85| ...|
| 01| 01001|21010001200017| 10223| 0.0| 4946.0| ...|
| 01| 01001|21010001200017| 10226| 0.0| 53832.19| ...|
Row : {Department : 01, City code : 01001, siret : 21010001200033}
Dataset<Row> :
| 01| 01001|21010001200033| 1021| 0.0| 38863.22| ...|
| 01| 01001|21010001200033| 10222| 0.0| 62067.0| ...|
| 01| 01001|21010001200033| 10228| 0.0| 9666.0| ...|
| 01| 01001|21010001200033| 1068| 0.0| 100121.62| ...|
Row : {Department : 01, City code : 01001, siret : 21010001200066}
Dataset<Row> :
| 01| 01001|21010001200066| 1641| 0.0| 100000.0| ...|
| 01| 01001|21010001200066| 3355| 587689.33| 0.0| ...|
| 01| 01001|21010001200066| 4011| 0.0| 0.0| ...|
| 01| 01001|21010001200066| 40171| 0.0| 10036.5| ...|
It's what was my first attempt was somewhat able to do,
rddComptesParSiret.flatMap((FlatMapFunction<Tuple2<String, Iterable<Row>>, Row>)comptesSiret
but without providing all the good keys (department and city code were missing breaking all the sorting previously done), and also : RDD
are no more in favor.
But that I wasn't able to achieve in Java through RelationalGroupedDataset
methods that don't seem to offer such tool.
Currently, I know how to do a groupBy or a sort, that way :
accounting.groupBy("department", "cityCode", "accountNumber", "siret").agg(...);
My question
How to browse
each record of
each group
[to perform sub calculations or other work]
group after group