Opening and reshaping xlsx files with nameless columns in r using a pattern

Question

I'm working with French electoral data but I'm having issues opening xlsx files to work on them in r. I was wondering if anyone had had the same problem and found a solution.

The issue is that only the first 29 columns out of +100 columns have names and the rest are nameless. I've tried editing the column names in excel before opening them but this solution is time consuming and prone to mistakes. I'm looking for a way to automatize the process.

The datasets have a pattern that I'm trying to exploit to rename the columns and reshape the files:

the first 6 columns correspond to the geographic id of the precinct (region, municipality, etc...)
the next 15 columns give information about aggregate results in the precinct (number of voters, number of registered voters, participation, etc..)
The next 8 columns give information about a given candidate and her results in the precinct (name, sex, party id, number of votes, .. etc)

These 29 columns have names.

The next columns are nameless and correspond to other candidates. They repeat the 8 columns for the other candidates.

There is another layer of difficulty since each precinct does not have the same number of candidates so the number of nameless columns changes.

Ideally, I would want r to recognize the pattern and reshape the datasets to long by creating a new row for each candidate keeping the precinct id and aggregate data in each row. To do this, I would like r to recognize each sequence of nameless 8 columns.

To simplify, let's say that my data frame looks like the following:

precinct_id	tot_votes	candidate_id	candidate_votes	...1	...2
Paris 05	1000	Jean Dupont	400	Paul Dupuy	300
Paris 06	500	Jean Dupont	50	Paul Dupuy	150

where:

candidate_id and candidate_votes correspond to the id and result of the first candidate
...1, ...2 is how r is automatically renaming the nameless columns that correspond to candidate_id and candidate_votes for candidate 2 in the same precinct.

I need r to select the observations in each sequence of 2 columns and paste them into new rows under candidate_id candidate_votes while keeping the precinct_id and precinct_votes columns.

precinct_id	tot_votes	candidate_id	candidate_votes
Paris 05	1000	Jean Dupont	400
Paris 06	500	Jean Dupont	50
Paris 05	1000	Paul Dupuy	300
Paris 06	500	Paul Dupuy	150

I have no idea how to reshape without column names... Any help would be greatly appreciated! Thanks!

PS: The files come from here: https://www.data.gouv.fr/fr/datasets/elections-legislatives-des-12-et-19-juin-2022-resultats-definitifs-du-premier-tour/

You've not told us how you're trying to read the Excel files, but here's an extract from the online doc for `readxl::read_excel()`: "col_names TRUE to use the first row as column names, FALSE to get default names, or a character vector giving a name for each column. If user provides col_types as a vector, col_names can have one entry per column, i.e. have the same length as col_types, or one entry per unskipped column". — Limey, Aug 08 '22 at 14:23
Hi, I'm using the readxl package. `df <- read_excel("filename.xlsx", .name_repair = janitor::make_clean_names)` is helpful but I still need to paste the missing column names in excel first which is still tedious. — LLS, Aug 08 '22 at 16:35
"I still need to paste the missing column names in excel first". I don't think you do. Skip the column headers with `skip=` and supply column names for all columns with `col_names=x` where `x` is a character vector as described in the extract from the online doc I quoted above. "which is ... tedious". Awww. Bless. — Limey, Aug 08 '22 at 16:59
Thank you for taking the time to answer. I'm not sure how to apply the vector solution if there are +100 missing column names. Right now, I'm trying to find some way to label the unnamed columns efficiently based on the columns (3:4 in the example I gave above) for which I have correct names (id and votes for candidate 1) so that I can use pivot_longer on the dataframe and create a new row for each candidate in the precinct using names_pattern. I'm not sure how to apply a character vector to do this. — LLS, Aug 08 '22 at 19:05

Limey · Accepted Answer · 2022-08-08T22:48:59.477

Actually, there's an even simpler solution to the one I suggested. .names_repair can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.

read_excel(
  "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
  .name_repair=function(x) {
    suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
    if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
    for (i in 1:length(x)) {
      if (i > 21) {
        x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
      }
    }
    x
  }
)
# A tibble: 35,429 × 197                                                                                                                       
   `Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
   <chr>                 <chr>             <chr>            <chr>            <chr>            <chr>            <chr>            <dbl>       <dbl>
 1 01                    Ain               01               1ère circonscri… 016              Arbigny          Complet            327         154
 2 01                    Ain               01               1ère circonscri… 024              Attignat         Complet           2454        1281
 3 01                    Ain               01               1ère circonscri… 029              Beaupont         Complet            446         224
 4 01                    Ain               01               1ère circonscri… 038              Bény             Complet            604         306
 5 01                    Ain               01               1ère circonscri… 040              Béréziat         Complet            362         179
 6 01                    Ain               01               1ère circonscri… 050              Boissey          Complet            262         137
 7 01                    Ain               01               1ère circonscri… 053              Bourg-en-Bresse  Complet          15516        8426
 8 01                    Ain               01               1ère circonscri… 057              Boz              Complet            391         210
 9 01                    Ain               01               1ère circonscri… 065              Buellas          Complet           1408         654
10 01                    Ain               01               1ère circonscri… 069              Certines         Complet           1169         639
# … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
#   `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
#   C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
#   C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
#   C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
#   C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
#   C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …

That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider() trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer() step doesn't fail.

For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.

read_excel(
  "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
  .name_repair=function(x) {
    suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
    if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
    for (i in 1:length(x)) {
      if (i > 21) {
        x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
      }
    }
    x
  }
)  %>% 
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
select(!1:21)
# A tibble: 6,235,504 × 3                                                                                                                      
   Candidate Variable   Value
   <chr>     <chr>      <chr>
 1 C1        NPanneau   2    
 2 C1        Sexe       M    
 3 C1        Nom        LAHY 
 4 C1        Prénom     Éric 
 5 C1        Nuance     DXG  
 6 C1        Voix       2    
 7 C1        PctVoixIns 0.61 
 8 C1        PctVoixExp 1.23 
 9 C2        NPanneau   8    
10 C2        Sexe       M    
# … with 6,235,494 more rows

Now add the pivot_wider(), again dropping the first 21 columns, purely for clarity.

read_excel(
  "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
  .name_repair=function(x) {
    suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
    if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
    for (i in 1:length(x)) {
      if (i > 21) {
        x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
      }
    }
    x
  }
)  %>% 
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
pivot_wider(names_from=Variable, values_from=Value) %>% 
select(!1:21)
# A tibble: 779,438 × 9                                                                                                                        
   Candidate NPanneau Sexe  Nom              Prénom    Nuance Voix  PctVoixIns PctVoixExp
   <chr>     <chr>    <chr> <chr>            <chr>     <chr>  <chr> <chr>      <chr>     
 1 C1        2        M     LAHY             Éric      DXG    2     0.61       1.23      
 2 C2        8        M     GUÉRAUD          Sébastien NUP    26    7.95       15.95     
 3 C3        7        F     ARMENJON         Eliane    ECO    3     0.92       1.84      
 4 C4        1        M     GUILLERMIN       Vincent   ENS    30    9.17       18.4      
 5 C5        3        M     BRETON           Xavier    LR     44    13.46      26.99     
 6 C6        5        M     MENDES           Michael   DSV    3     0.92       1.84      
 7 C7        6        M     BELLON           Julien    REC    6     1.83       3.68      
 8 C8        4        F     PIROUX GIANNOTTI Brigitte  RN     49    14.98      30.06     
 9 C9        NA       NA    NA               NA        NA     NA    NA         NA        
10 C10       NA       NA    NA               NA        NA     NA    NA         NA        
# … with 779,428 more rows

Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)

read_excel(
  "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
  .name_repair=function(x) {
    suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
    if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
    for (i in 1:length(x)) {
      if (i > 21) {
        x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
      }
    }
    x
  }
)  %>% 
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
pivot_wider(names_from=Variable, values_from=Value) %>% 
mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>% 
select(!1:21)
# A tibble: 779,438 × 9                                                                                                                        
   Candidate NPanneau Sexe  Nom              Prénom    Nuance  Voix PctVoixIns PctVoixExp
   <chr>     <chr>    <chr> <chr>            <chr>     <chr>  <dbl>      <dbl>      <dbl>
 1 C1        2        M     LAHY             Éric      DXG        2       0.61       1.23
 2 C2        8        M     GUÉRAUD          Sébastien NUP       26       7.95      16.0 
 3 C3        7        F     ARMENJON         Eliane    ECO        3       0.92       1.84
 4 C4        1        M     GUILLERMIN       Vincent   ENS       30       9.17      18.4 
 5 C5        3        M     BRETON           Xavier    LR        44      13.5       27.0 
 6 C6        5        M     MENDES           Michael   DSV        3       0.92       1.84
 7 C7        6        M     BELLON           Julien    REC        6       1.83       3.68
 8 C8        4        F     PIROUX GIANNOTTI Brigitte  RN        49      15.0       30.1 
 9 C9        NA       NA    NA               NA        NA        NA      NA         NA   
10 C10       NA       NA    NA               NA        NA        NA      NA         NA   
# … with 779,428 more rows

This, I think, is the format you want, though you may need to arrange() the rows into the order you want. Obviously, you should drop the final %>% select(!1:21) for your production version.

It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply to read an entire folder into a list of data frames. However...

It appears that not every file in the folder has the same layout. resultats-par-niveau-fe-t1-outre-mer.xlsx, for example, appears to have fewer "prefix columns" before the 8-columns-per-candidate repeat begins.
The import generates several warnings. This appears to be because the election(?) with the largest number of candidates does not appear in the first rows of the worksheet. I've not investigated whether these warnings are generated by meaningful problems with the import.

Opening and reshaping xlsx files with nameless columns in r using a pattern

1 Answers1