0

I have a directory with over 1,600 .txt files. An unknown number of those files have blank entries in column 22. This is causing R to stop running a different set of code.

Is there a way to write code to have R scan all of the .txt files in the specified directory for blank entries in any row in column 22? Each .txt file has 3600 rows.

In addition, what code would tell R to return the names of all of the files where this condition is met so I can go through the directory and remove them?

Thank you very much!!

Gavia_immer
  • 67
  • 1
  • 10
  • could you please provide the first 23 line of one .txt with the issue in line 22 (this way a possible answer can be tested to make sure it really helps you) . Also the function you are using to read in the .txt would help - maybe there is a faster/reliable function or way to solve without prior scaning/manipulation of the files – DPH Jul 02 '21 at 21:43
  • The issue isn't necessarily in line 22, it's always in column 22. For instance, the issue in the .txt file I'm looking at is around line 3400 where there is a blank in column 22. I am thinking I can use which.na() to do this, but I'm not sure how to loop it through all of the files. – Gavia_immer Jul 02 '21 at 21:52
  • please provide a few lines of your original data with one missing case and which function you use to read in the data. If the error does happen in a function call after the reading function, all lines of your processing would help. Normally alleatory missing values (beeing in one or multiple columns) should not cause an error when reading. In a processing loop you can use NEXT to jump (for example when detecting one NA in column 22) – DPH Jul 02 '21 at 22:05
  • I understand where you're coming from with the 'Next' solution, and that may be the better route in some cases. But since I'm so new to coding, I'm really trying to learn some basics. I think at this point I'm most interested in identifying the files with blanks in the column titled 'dbA'. I normally post example data, but in this case, it doesn't seem appropriate since I don't know where to begin to read in all of the files and then scan for blanks in column 22. I think a function with which(is.na(file$dbA)) is probably going to be involved. – Gavia_immer Jul 02 '21 at 22:23
  • I posted an example to the next usage, scaning for a missing value in a specific column - you can use this to solve you specific need: a loop that jumps if all values in the specific column are not na (insteald of != 0 you would use ==0) and in all other cases executes deletion of the file just read in (you can also just collect the file names with this type of loop) – DPH Jul 02 '21 at 22:27

1 Answers1

0

You can use "next" in you processing loop to jump when detecting a file with at least one missing value in a specific column

# example input is a list of two dataframes, one with missing value in column 2
dd <- list(data.frame(x = 1:2, y = c(1, NA)),
           data.frame(x = 1:2, y = 1:2))

# loop through the list and print list item/data.frame (you get both data.frames printed)
for (i in 1:length(dd)){
    print(dd[[i]])
}

# now the same loop but with a jump in case colum 2 has any value (only the second data.frame is printed)
for (i in 1:length(dd)){
    # if any value in column 2 is NA then jump loop
    if (sum(is.na(dd[[i]][,2])) != 0) next
    print(dd[[i]])
}

It might be better to impute the missing values, depending on what your are trying to achive or it might even be possible to adapt your processing to handle the possibly missing line, etc. Anyhow instead of deleting, just jumping might be better. You can also alter the code to delete the just read in file, once a missing value in the specific column is detect (instead of "next" you would use a function to remove the file)

Adapted acording to list.files():

for (i in 1:length(file.list)){
  df <-read.table(file.list[i]) #not sure which function you are using
  if(ncol(df) != 54) next #make sure the file has 54 columns
  if(sum(is.na(df[,22])) != 0) next #make sure you jump in case of at least 1 na in column 22
  print(df) 
}
DPH
  • 4,244
  • 1
  • 8
  • 18
  • To read in all files I use: file.list <- list.files() I've modified your code to look like: for (i in 1:length(file.list)){ if(sum(is.na(file.list[[i]][,22])) != 0) next print(file.list[[i]]) } But when I run it I get the following message: Error in file.list[[i]][, 22] : incorrect number of dimensions Any thoughts? – Gavia_immer Jul 02 '21 at 22:40
  • I added the correction of this code in my answer (makes it easier to read and copy than posting it as a coment) – DPH Jul 02 '21 at 22:49
  • After switching your code from read.table to read.delim I get this error: Error in `[.data.frame`(df, , 22) : undefined columns selected. – Gavia_immer Jul 02 '21 at 22:51
  • I fear this means, that there are some files with less than 22 columns... are you sure all files have the same number of columns? (I will be offline for a few minutes but will get back asap) – DPH Jul 02 '21 at 22:53
  • Yes, all of my .txt files have 54 columns and 3600 rows. Thank you for your continued patience and help. – Gavia_immer Jul 02 '21 at 22:57
  • i altered the code with a check if there are 54 columns and if not jump before checking the content of column 22 - please try to run this – DPH Jul 02 '21 at 23:14
  • This is the error I get: Error in read.table(file.list[i]) : duplicate 'row.names' are not allowed. – Gavia_immer Jul 02 '21 at 23:16
  • have a look here for this error: https://stackoverflow.com/questions/8854046/duplicate-row-names-are-not-allowed-error or use your read.delim instead if it works better for you... a faster aproach is the fread() functionfor the data.table package – DPH Jul 02 '21 at 23:18
  • It ran, but instead of doing what we wanted, for 'df' it returned 2634 rows for the last text file in the directory. – Gavia_immer Jul 02 '21 at 23:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234462/discussion-between-dph-and-gavia-immer). – DPH Jul 02 '21 at 23:30