1

I have a CSV file like this:

C, Comment1
C, Comment2
H, col_1, col_2, col_3
H, num, char, char
D, 1, a, b
D, 2, c, d
D, 3, e, f
D, 4, g, h
D, 5, i, j
F, 5 lines

how can I per-process this CSV file before importing into R? I want to skip the lines that do not start with "D" and use the third row as a header and then skip the first column

the imported data frame should be some thing like this:

col_1, col_2, col_3
1, a, b
2, c, d
3, e, f
4, g, h
5, i, j
Shahab Einabadi
  • 307
  • 4
  • 15

1 Answers1

1

You can load the data as a text file using readLines() and each line will be stored in a vector as strings. Then, you'll be able to analyze your data and find the structure that fits the best on your problem.

Here is a code chunk that may help you:

# load environment
library(stringr)

# define the data path
data_path = '~/Downloads/file.csv'
# load data as a character vector
data = readLines(data_path)
# remove the first column, since it seems to be unuseful
data = str_remove(data, '^., ')
# detect and keep lines having 3 columns (2 commas)
c = str_count(data, ',')
data = data[c == 2]
# get rid of descriptors
d = !str_detect(data, 'num|char')
data = data[d]
# overwrite the data
writeLines(data, data_path)

# now load the data as a dataframe
df = read.csv(data_path)
# print output
print(df)

Here is the output:

  col_1 col_2 col_3
1     1     a     b
2     2     c     d
3     3     e     f
4     4     g     h
5     5     i     j

The solution is not so generalized, but I think you cannot avoid detecting specific patterns, in order to remove/keep them from your data.

Let us know if it helped you somehow..!

rodolfoksveiga
  • 1,181
  • 4
  • 17
  • 1
    Thanks, it works with a little modification. I am wondering if there is any general code to pre-process the data. The first column has valuable information. If the line is a comment the first column is "C", if it is header it starts with "H", when we have real data it is "D". I think there should be a general way to filter importing data based on the first column – Shahab Einabadi Oct 14 '20 at 01:05
  • i didn't know this type of file, thanks for sharing @ShahabEinabadi. well, since you have the patterns on your hands, you can build up a generalized code to clean this file format. working with string vector is easy, you'll get it fast! you should even share it with the community, it might be very useful for other people! – rodolfoksveiga Oct 14 '20 at 01:20
  • I found this question https://stackoverflow.com/questions/23197243/how-to-read-only-lines-that-fulfil-a-condition-from-a-csv-into-r; but it doesn't work in my case! I have some comment lines and two headers, read.csv.sql can't handle it, this function needs a clean table – Shahab Einabadi Oct 14 '20 at 14:06