0

I've create a list of the files to loop through, but every looping construct I try (from cutting/pasting from code snippets) has failed. I really want to subset the many .txt files in my working directory by one single column/variable, then merge the smaller files all together into one data frame.

  • 4
    Can you make a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your problem? – markus Oct 28 '18 at 20:14
  • https://stackoverflow.com/help/how-to-ask – zx8754 Oct 28 '18 at 20:14
  • I'm a novice/beginner. I've downloaded a huge 2 Gb files and had to break it up in to 191 smaller files with 50,000 rows and 541 columns each, and the data is tab delimited. Each of those 50,000 rows in each file contains only about 700-1000 rows of interest that I need to extract by a col=="name" in subset(), and then need to merge all of them back together. – T. Carson Oct 28 '18 at 20:22

2 Answers2

1

Here is a complete, working solution using base R and the readr package. First, we download data from Alberto Barradas' Pokémon Stats data (originally from kaggle.com). After unzipping the data files, we read their file names from disk, and use lapply() with readr::read_csv() to load them into memory and subset based on the Type1 column.

We then use do.call() to combine the files into a single data frame.

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",                  
             "pokemonData.zip",
              method="curl",mode="wb")
unzip("pokemonData.zip")

thePokemonFiles <- list.files("./pokemonData",
                              full.names=TRUE)
thePokemonFiles 
library(readr)
pokemonDataFiles <- lapply(thePokemonFiles,function(x) {
  y <- read_csv(x)
  y[y$Type1 == "Grass",] # return data frame after subsetting rows to result object
})
combined <- do.call(rbind,pokemonDataFiles)
head(combined)

...and the output:

> head(combined)
# A tibble: 6 x 13
  Number Name           Type1 Type2 Total    HP Attack Defense SpecialAtk
   <int> <chr>          <chr> <chr> <int> <int>  <int>   <int>      <int>
1      1 Bulbasaur      Grass Pois…   318    45     49      49         65
2      2 Ivysaur        Grass Pois…   405    60     62      63         80
3      3 Venusaur       Grass Pois…   525    80     82      83        100
4      3 VenusaurMega … Grass Pois…   625    80    100     123        122
5     43 Oddish         Grass Pois…   320    45     50      55         75
6     44 Gloom          Grass Pois…   395    60     65      70         85
# ... with 4 more variables: SpecialDef <int>, Speed <int>,
#   Generation <int>, Legendary <chr>
> 

Note: readr::read_csv reads a delimited file, so it will handle tab-separated data files.

Len Greski
  • 10,505
  • 2
  • 22
  • 33
0

Without any dummy data or knowledge about what you have tried so far: Assuming that your data is coherent and the files are the only content in one directory, you could use the following snippet:

install.packages("data.table")
library(data.table)
fileList <- list.files("/path/to/files/")

for (i in 1:length(fileList){
  DF <- fread(fileList[i])
  interestingDF <- DF[DF$col == "name"]
  fwrite(interestingDF, file="/path/to/new_file.txt", append = TRUE)
}

finalDF <- fread("/path/to/new_file.txt")

So, you list all the files and process them separately in a for loop. You read the file, extract the wanted rows with DF[DF$col == "name" and append them to a new file. This will save you a lot of memory, as you don't have to continuously grow the new data frame in a for loop. When the interesting data is stored in a new file, you just read it in.

A side note, fread and fwrite are easy to use and fast tools, I highly recommend those

annhak
  • 602
  • 5
  • 16