0

The code in this question and the datasets used in the code can be found in my GitHub Repository for this project.

I have loaded a set of 20 individual csv file formatted datasets into R using the following code:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
directory_paths <- "~/GMU folders (local)/DAEN_698/other datasets/sample obs(20 csvs)"
filepaths_list <- list.files(path = directory_paths, full.names = TRUE, recursive = TRUE)
> head(filepaths_list, 4)
[1] "C:/Users/Spencer/Documents/GMU folders (local)/DAEN_698/other datasets/sample obs(20 csvs)/0-3-1-1.csv" 
[2] "C:/Users/Spencer/Documents/GMU folders (local)/DAEN_698/other datasets/sample obs(20 csvs)/0-3-1-10.csv"
[3] "C:/Users/Spencer/Documents/GMU folders (local)/DAEN_698/other datasets/sample obs(20 csvs)/0-3-1-11.csv"
[4] "C:/Users/Spencer/Documents/GMU folders (local)/DAEN_698/other datasets/sample obs(20 csvs)/0-3-1-12.csv"

While the actual filefolder full of csvs on the other hand looks like this: enter image description here

As you can see, they are out of order. It is ordering them based on the individual digit in the last of the 4 descriptors for each dataset.

How can I reorder their names to be arranged correctly as in the attached photo?

I was also suggested to use this on here before:

# reformat the names of each of the csv file formatted dataset
DS_names_list <- basename(filepaths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
> DS_names_list
 [1] "0-3-1-1"  "0-3-1-10" "0-3-1-11" "0-3-1-12" "0-3-1-13" "0-3-1-14" "0-3-1-15" "0-3-1-16"
 [9] "0-3-1-17" "0-3-1-18" "0-3-1-19" "0-3-1-2"  "0-3-1-20" "0-3-1-3"  "0-3-1-4"  "0-3-1-5" 
[17] "0-3-1-6"  "0-3-1-7"  "0-3-1-8"  "0-3-1-9"

But any alteration to this will not reorder or sort the actual file path list itself.

p.s. Each of the numbers in each of the filenames indicates a property of the underlying regression model which truly characterizes the data within that csv file in the following manner where their names are a11 n1-n2-n3-n4 in the following manner:

n1 indicates the degree of multicollinearity among all regressors (can be 0, 0.4 or 1) n2 indicates the true number of regressors in the underlying structural model n3 indicates the Error Variance for the true model n4 indicates the # random variation of the possible datasets which could be created with the following 3 characteristics from 1-500 for each of the previous conditions

Marlen
  • 171
  • 11
  • 1
    Any of the lexicographic sorting methods [in the answers to this FAQ](https://stackoverflow.com/q/12806128/903061) should work for this. – Gregor Thomas Dec 21 '22 at 00:48

2 Answers2

2

Okay, I'm going to try to simplify this down so it's clear and concise. Minimal reproducible examples are much quicker and easier to answer than lengthy questions with github links and screenshots.

As far as I can tell, your problem is this: You have data like this:

## nicely copy/pasteable sample data
## demonstrates the problem
## omits unneeded details
sample_data = c(
  "C:/path/0-3-1-1.csv", 
  "C:/path/0-3-1-10.csv",
  "C:/path/0-3-1-2.csv"
)

And you want to be able to sort it by the numeric components separated by dashes, treated numerically not alphabetically, so the desired result is

desired_result = c(
  "C:/path/0-3-1-1.csv", 
  "C:/path/0-3-1-2.csv",
  "C:/path/0-3-1-10.csv"
)

Here's an approach:

# extract the file names (as you have already done)
filenames = sample_data |> basename() |> tools::file_path_sans_ext()


my_order = filenames |> 
  # split apart the numbers
  strsplit(split = "-", fixed = TRUE) |>
  unlist() |> 
  # convert them to numeric and get them in a data frame
  as.numeric() |> 
  matrix(nrow = length(filenames), byrow = TRUE) |>
  as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)

my_order
# [1] 1 3 2

sample_data[my_order]
# [1] "C:/path/0-3-1-1.csv"  "C:/path/0-3-1-2.csv"  "C:/path/0-3-1-10.csv"

The my_order result gives the indices to rearrange the original data to the desired result. You can use it on the sample_data or on just the extracted file names.

Another solution is to use the gtools::mixedorder() or gtools::mixedsort() functions. Confusingly, when I tried them out on the sample data they gave the reverse order. Then I realized that the gtools functions interpret your - separators as negative signs. So to use that tool, we would need to replace - with a different character:

sample_data |> 
  gsub(pattern = "-", replacement = "|", fixed = TRUE) |>
  gtools::mixedorder()
# [1] 1 3 2
## same ordering result as above
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Nevermind Gregor, I take all of my previous 3 reply comments back, I got it to work by swapping DS_names_list for filepaths_list in your suggested code, thanks a million good sir! – Marlen Dec 21 '22 at 09:16
0

Another approach essentially the same as @Gregor's logic. Split the components out, and then call all of them as a list of inputs to the order function.

ord <- do.call(order,
         strcapture("(\\d+)-(\\d+)-(\\d+)-(\\d+)", 
                   basename(sample_data), proto=list(1L,1L,1L,1L)))
sample_data[ord]
#[1] "C:/path/0-3-1-1.csv"  "C:/path/0-3-1-2.csv"  "C:/path/0-3-1-10.csv"
thelatemail
  • 91,185
  • 12
  • 128
  • 188