38

I'm using R to visualize some data all of which is in .txt format. There are a few hundred files in a directory and I want to load it all into one table, in one shot.

Any help?

EDIT:

Listing the files is not a problem. But I am having trouble going from list to content. I've tried some of the code from here, but I get a bug with this part:

all.the.data <- lapply( all.the.files,  txt  , header=TRUE)

saying

 Error in match.fun(FUN) : object 'txt' not found

Any snippets of code that would clarify this problem would be greatly appreciated.

Tung
  • 26,371
  • 7
  • 91
  • 115
Eric Brotto
  • 53,471
  • 32
  • 129
  • 174
  • 1
    The problem is `txt` is not a function. The link you pointed to is about the `read.csv` function. – Wok Aug 03 '10 at 17:56

5 Answers5

41

You can try this:

filelist = list.files(pattern = ".*.txt")

#assuming tab separated values with a header    
datalist = lapply(filelist, function(x)read.table(x, header=T)) 

#assuming the same header/columns for all files
datafr = do.call("rbind", datalist) 
Greg
  • 11,564
  • 5
  • 41
  • 27
35

There are three fast ways to read multiple files and put them into a single data frame or data table

First get the list of all txt files (including those in sub-folders)

list_of_files <- list.files(path = ".", recursive = TRUE,
                            pattern = "\\.txt$", 
                            full.names = TRUE)

1) Use fread() w/ rbindlist() from the data.table package

#install.packages("data.table", repos = "https://cran.rstudio.com")
library(data.table)

# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
                use.names = TRUE, idcol = "FileName")

2) Use readr::read_table2() w/ purrr::map_df() from the tidyverse framework:

#install.packages("tidyverse", 
#                 dependencies = TRUE, repos = "https://cran.rstudio.com")
library(tidyverse)

# Read all the files and create a FileName column to store filenames
df <- list_of_files %>%
  set_names(.) %>%
  map_df(read_table2, .id = "FileName")

3) (Probably the fastest out of the three) Use vroom::vroom():

#install.packages("vroom", 
#                 dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)

# Read all the files and create a FileName column to store filenames
df <- vroom(list_of_files, .id = "FileName")

   

Note: to clean up file names, use basename or gsub functions

Benchmark: readr vs data.table vs vroom for big data

vroom-benchmark


Edit 1: to read multiple csv files and skip the header using readr::read_csv

list_of_files <- list.files(path = ".", recursive = TRUE,
                            pattern = "\\.csv$", 
                            full.names = TRUE)

df <- list_of_files %>%
  purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  purrr::map_df(read_csv, 
                col_names = FALSE,
                skip = 1,
                .id = "FileName")

Edit 2: to convert a pattern including a wildcard into the equivalent regular expression, use glob2rx()

Tung
  • 26,371
  • 7
  • 91
  • 115
  • 1
    How can I select only first three variables/columns of the list_of_files? – mRiddle Apr 01 '18 at 17:05
  • 1
    If you use `fread`: use `select = c(1:3)` or `select = c("colname 1", "colname 2", "colname 3")`. If you use `read_table2`, check the argument `col_types = cols_only(colname1 = "i", colname2 = "d")` where `i` is integer and `d` is double. HTH – Tung Apr 01 '18 at 17:20
  • 1
    See my recent answer for more options for cleaning up filenames https://stackoverflow.com/a/49546846/786542 – Tung Apr 01 '18 at 17:43
  • Relevant to `readr` usage: https://stackoverflow.com/questions/50651898/what-are-permissible-column-objects-of-the-form-col-used-in-readr/50652089#50652089 – Tung Sep 19 '18 at 20:45
  • @Tung I am using your Edit 1 to merge several .csv files. The output is coming as row bind but I want the output as column bind. Any help in this regard is highly appreciated. – UseR10085 Jan 03 '20 at 11:08
  • 1
    gotta upvote someone using `data.table` – WestCoastProjects Feb 26 '20 at 05:27
  • 1
    @BappaDas: did you try `map_dfc()`? – Tung Nov 15 '20 at 05:56
11

There is a really, really easy way to do this now: the readtext package.

readtext::readtext("path_to/your_files/*.txt")

It really is that easy.

Ken Benoit
  • 14,454
  • 27
  • 50
  • 1
    This is a nice function, but `readtext` will just import all of the text into a single column. In most cases there will be additional manipulation required after this to make the data usable. – EcologyTom Apr 16 '18 at 15:58
  • 1
    True, that's what the **quanteda** package is for. – Ken Benoit Apr 17 '18 at 11:50
5

Look at the help for functions dir() aka list.files(). This allows you get a list of files, possibly filtered by regular expressions, over which you could loop.

If you want to them all at once, you first have to have content in one file. One option would be to use cat to type all files to stdout and read that using popen(). See help(Connections) for more.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
5

Thanks for all the answers!

In the meanwhile, I also hacked a method on my own. Let me know if it is any useful:

library(foreign)

setwd("/path/to/directory")

files <-list.files()

data <- 0


for (f in files) {

tempData = scan( f, what="character")

data <- c(data,tempData)    

} 
Eric Brotto
  • 53,471
  • 32
  • 129
  • 174