I've create a list of the files to loop through, but every looping construct I try (from cutting/pasting from code snippets) has failed. I really want to subset the many .txt files in my working directory by one single column/variable, then merge the smaller files all together into one data frame.
-
4Can you make a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your problem? – markus Oct 28 '18 at 20:14
-
https://stackoverflow.com/help/how-to-ask – zx8754 Oct 28 '18 at 20:14
-
I'm a novice/beginner. I've downloaded a huge 2 Gb files and had to break it up in to 191 smaller files with 50,000 rows and 541 columns each, and the data is tab delimited. Each of those 50,000 rows in each file contains only about 700-1000 rows of interest that I need to extract by a col=="name" in subset(), and then need to merge all of them back together. – T. Carson Oct 28 '18 at 20:22
2 Answers
Here is a complete, working solution using base R and the readr
package. First, we download data from Alberto Barradas' Pokémon Stats data (originally from kaggle.com). After unzipping the data files, we read their file names from disk, and use lapply()
with readr::read_csv()
to load them into memory and subset based on the Type1
column.
We then use do.call()
to combine the files into a single data frame.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
thePokemonFiles
library(readr)
pokemonDataFiles <- lapply(thePokemonFiles,function(x) {
y <- read_csv(x)
y[y$Type1 == "Grass",] # return data frame after subsetting rows to result object
})
combined <- do.call(rbind,pokemonDataFiles)
head(combined)
...and the output:
> head(combined)
# A tibble: 6 x 13
Number Name Type1 Type2 Total HP Attack Defense SpecialAtk
<int> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 1 Bulbasaur Grass Pois… 318 45 49 49 65
2 2 Ivysaur Grass Pois… 405 60 62 63 80
3 3 Venusaur Grass Pois… 525 80 82 83 100
4 3 VenusaurMega … Grass Pois… 625 80 100 123 122
5 43 Oddish Grass Pois… 320 45 50 55 75
6 44 Gloom Grass Pois… 395 60 65 70 85
# ... with 4 more variables: SpecialDef <int>, Speed <int>,
# Generation <int>, Legendary <chr>
>
Note: readr::read_csv reads a delimited file, so it will handle tab-separated data files.

- 10,505
- 2
- 22
- 33
Without any dummy data or knowledge about what you have tried so far: Assuming that your data is coherent and the files are the only content in one directory, you could use the following snippet:
install.packages("data.table")
library(data.table)
fileList <- list.files("/path/to/files/")
for (i in 1:length(fileList){
DF <- fread(fileList[i])
interestingDF <- DF[DF$col == "name"]
fwrite(interestingDF, file="/path/to/new_file.txt", append = TRUE)
}
finalDF <- fread("/path/to/new_file.txt")
So, you list all the files and process them separately in a for
loop. You read the file, extract the wanted rows with DF[DF$col == "name"
and append them to a new file. This will save you a lot of memory, as you don't have to continuously grow the new data frame in a for loop. When the interesting data is stored in a new file, you just read it in.
A side note, fread and fwrite are easy to use and fast tools, I highly recommend those

- 602
- 5
- 16