Actually, there's an even simpler solution to the one I suggested. .names_repair
can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
)
# A tibble: 35,429 × 197
`Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01 Ain 01 1ère circonscri… 016 Arbigny Complet 327 154
2 01 Ain 01 1ère circonscri… 024 Attignat Complet 2454 1281
3 01 Ain 01 1ère circonscri… 029 Beaupont Complet 446 224
4 01 Ain 01 1ère circonscri… 038 Bény Complet 604 306
5 01 Ain 01 1ère circonscri… 040 Béréziat Complet 362 179
6 01 Ain 01 1ère circonscri… 050 Boissey Complet 262 137
7 01 Ain 01 1ère circonscri… 053 Bourg-en-Bresse Complet 15516 8426
8 01 Ain 01 1ère circonscri… 057 Boz Complet 391 210
9 01 Ain 01 1ère circonscri… 065 Buellas Complet 1408 654
10 01 Ain 01 1ère circonscri… 069 Certines Complet 1169 639
# … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
# `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
# C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
# C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
# C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
# C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
# C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …
That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider()
trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer()
step doesn't fail.
For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
select(!1:21)
# A tibble: 6,235,504 × 3
Candidate Variable Value
<chr> <chr> <chr>
1 C1 NPanneau 2
2 C1 Sexe M
3 C1 Nom LAHY
4 C1 Prénom Éric
5 C1 Nuance DXG
6 C1 Voix 2
7 C1 PctVoixIns 0.61
8 C1 PctVoixExp 1.23
9 C2 NPanneau 8
10 C2 Sexe M
# … with 6,235,494 more rows
Now add the pivot_wider()
, again dropping the first 21 columns, purely for clarity.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 15.95
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.46 26.99
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 14.98 30.06
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 16.0
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.5 27.0
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 15.0 30.1
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
This, I think, is the format you want, though you may need to arrange()
the rows into the order you want. Obviously, you should drop the final %>% select(!1:21)
for your production version.
It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply
to read an entire folder into a list of data frames. However...
- It appears that not every file in the folder has the same layout. resultats-par-niveau-fe-t1-outre-mer.xlsx, for example, appears to have fewer "prefix columns" before the 8-columns-per-candidate repeat begins.
- The import generates several warnings. This appears to be because the election(?) with the largest number of candidates does not appear in the first rows of the worksheet. I've not investigated whether these warnings are generated by meaningful problems with the import.