0

I'm trying to prepare a dataset to use it as training data for a deep neural network. It consists of 13 .txt files, each between 500MB and 2 GB large. However, when trying to run a "data_prepare.py" file, I get the Value error of this post's title.

Reading answers from previous posts, I have loaded my data into R and checked both for NaN and infinite numbers, but the commands used tell me there appears to be nothing wrong with my data. I have done the following:

  1. I load my data as one single dataframe using magrittr, data.table and purrr packages(there are about 300 Million rows, all with 7 variables):
txt_fread <- 
  list.files(pattern="*.txt") %>%
  map_df(~fread(.))
  1. I have used sapply to check for finite and NaN values:
>any(sapply(txt_fread, is.finite))
[1] TRUE
> any(sapply(txt_fread, is.nan))
[1] FALSE

I have also tried loading each data frame into a jupyter notebook and check individually for those values using the following commands:

file1= pd.read_csv("File_name_xyz_intensity_rgb.txt", sep=" ", header=None)

np.any(np.isnan(file1))
False

np.all(np.isfinite(file1))
True

And when I use print(file1.info()), this is what I get as info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22525176 entries, 0 to 22525175
Data columns (total 7 columns):
 #   Column  Dtype  
---  ------  -----  
 0   0       float64
 1   1       float64
 2   2       float64
 3   3       int64  
 4   4       int64  
 5   5       int64  
 6   6       int64  
dtypes: float64(3), int64(4)
memory usage: 1.2 GB
None

I know the file containing the code (data_prepare.py) works because it runs properly with a similar dataset. I therefore know it must be a problem with the new data I mention here, but I don't know what I have missed or done wrong while checking for NaNs and infinites. I have also tried reading and checking the .txt files individually, but it also hasn't helped much.

Any help is really appreciated!!

Btw: the R code with map_df came from a post by leerssej in How to import multiple .csv files at once?

  • Why is this tagged R if your error apparently is thrown by python code? – Roland Apr 07 '20 at 08:08
  • Because I used R to check for those values in the dataset (in the first two points), but also tried checking for them in a jupyter notebook. – E. Andrade Velásquez Apr 07 '20 at 08:18
  • I still fail to see how this is a question about your R code. Anyway, you probably mean to use `any(!sapply(txt_fread, is.finite)))` (note the `!`) or (better) `all(sapply(txt_fread, is.finite))`. – Roland Apr 07 '20 at 08:23
  • Ok, thanks. I used the second option ("all(sapply)") but it returns TRUE for is.finite and FALSE for is.nan. I imagine what I still have to check is whether the values are "too large for dtype('float64')"? I've tried looking around, but still don't understand how do I do that..? – E. Andrade Velásquez Apr 07 '20 at 08:32
  • I can only reiterate: I believe you have python question. And you need to provide a reproducible example or at least show the code that throws the error. – Roland Apr 07 '20 at 09:41

0 Answers0