How do I extract specific rows from a CSV and format the data in R?

Question

I have a CSV file that contains thousands of lines like this:

1001;basket/files/legobrick.mp3
4096;basket/files/sunshade.avi
2038;data/lists/blockbuster.ogg
2038;data/random/noidea.dat

I want to write this to a new CSV file but include only rows which contain '.mp3' or '.avi'. The output file should be just one column and look like this:

"basket/files/legobrick.mp3#1001",
"basket/files/sunshade.avi#4096",

So the first column should be suffixed to the second column and separated by a hash symbol and each line should be quoted and separated by a comma as shown above.

The source CSV file does not contain a header with column names. It's just data.

Can someone tell me how to code this in R?

Edit (following marked answer): This question is not a duplicate because it involves filtering rows and the output code format is completely different requiring different processing methods. The marked answer is also completely different which really backs up my assertion that this is not a duplicate.

It's not a duplicate. In my question here, only selected rows should be processed (depending on whether the row contains a specific value) and the output format is different requiring different processing methods. — Jstation, Jun 12 '19 at 10:44

Ronak Shah · Answer 1 · 2019-06-12T09:38:01.377

2

You can do it in the following way :

#Read the file with ; as separator
df <- read.csv2(text = text, header = FALSE, stringsAsFactors = FALSE)

#Filter the rows which end with "avi" or "mp3"
inds <- grepl("avi$|mp3$", df$V2)

#Create a new dataframe by pasting those rows with a separator
df1 <- data.frame(new_col = paste(df$V2[inds], df$V1[inds], sep = "#"))
df1

#                          new_col
#1 basket/files/legobrick.mp3#1001
#2  basket/files/sunshade.avi#4096

#Write the csv
write.csv(df1, "/path/of/file.csv", row.names = FALSE)

Or if you want it as a text file you can do

write.table(df1, "path/test.txt", row.names = FALSE, col.names = FALSE, eol = ",\n")

data

text = "1001;basket/files/legobrick.mp3
4096;basket/files/sunshade.avi
2038;data/lists/blockbuster.ogg
2038;data/random/noidea.dat"

edited Jun 12 '19 at 09:38

answered Jun 12 '19 at 08:57

Ronak Shah

377,200
20
156
213

This looks great except it's two columns, it surrounds each column in quotes and doesn't suffix each row with a comma. How do I make it look like this: "basket/files/legobrick.mp3#1001", (where there is only one column and each row is quoted and suffixed with a comma)? – Jstation Jun 12 '19 at 09:11
Almost, but the comma is inside the quote like this: "basket/files/sunshade.avi#4096," but I need it to be outside the quote like this: "basket/files/sunshade.avi#4096", The output file also contained a column header but I removed this by changing the last line to: write.table(df1, "outputfile.csv", row.names = FALSE, col.names=FALSE) – Jstation Jun 12 '19 at 09:31
1

@Jstation do you want it as text file instead? See the update with `write.table` – Ronak Shah Jun 12 '19 at 09:38
This is almost there now except each row needs to be suffixed with a comma after the end quote like this: "basket/files/sunshade.avi#4096", – Jstation Jun 12 '19 at 10:33
1

@Jstation If you check the `txt` file it has a comma suffix after the quotes (`"`). – Ronak Shah Jun 12 '19 at 10:34
I opened the output file and the comma isn't present. I'm opening the source file like this: df <- read.csv2("mycsvfile.csv", text = text, header = FALSE, stringsAsFactors = FALSE) and writing it like this: write.csv(df1, "mycsvfile.csv", row.names = FALSE) but the comma is not present. – Jstation Jun 12 '19 at 10:43
1

@Jstation please use `write.table` to write the file as shown in the answer. You'll not be able to see "," with `write.csv` – Ronak Shah Jun 12 '19 at 10:46
1

My apologies. I didn't see the write.table amendment. I changed write.csv to write.table as you instructed and the output file is now absolutely perfect. Even with thousands and thousands of source file lines, the processing is really fast. Many, many thanks, I'm very grateful to you, @Ronak Shah. – Jstation Jun 12 '19 at 10:50

score 1 · Answer 2 · answered Jun 12 '19 at 10:34

1

See whether the below code helps

library(tidyverse)
df %>% 
  filter(grepl("\\.mp3|\\.avi", file_path)) %>% 
  mutate(file_path = paste(file_path, ID, sep="#")) %>% 
  pull(file_path) %>% dput

answered Jun 12 '19 at 10:34

Theo

575
3
8

PavoDive · Accepted Answer · 2019-06-14T00:52:49.810

0

A data.table answer:

dt <- fread("file.csv")

fwrite(dt[V2 %like% "mp3$|avi$", .(paste0(V2, "#", V1))], "output.csv", col.names = FALSE)

edited Jun 14 '19 at 00:52

answered Jun 13 '19 at 02:12

PavoDive

6,322
2
29
55

This is very neat but it produces an error 'is.list(x) is not TRUE'. I believe this is because it is trying to read the CSV file as a data table. If I add 'data.table = FALSE' to 'fread' then it reads the CSV file ok but then fwrite can't access the columns. The source CSV is exactly as shown at the top of my question. Ronak Shah's solution works because it reads the source CSV as text then creates a new column for the output data. – Jstation Jun 13 '19 at 09:45
I'm sorry it didn't work. Let me work on it in a couple of hours, I'm away from my computer. – PavoDive Jun 13 '19 at 10:05
1

See the modified code. I added the `col.names = FALSE` argument to get the file as you need it. – PavoDive Jun 14 '19 at 00:53
1

Wow @PavoDive this is incredible, I love it. It's extremely compact and processes like lightning! Seriously, great job! – Jstation Jun 14 '19 at 11:15
1

That's the what the "f" stands for in `fread` and `fwrite`: **fast**. That's the overall data.table mantra ;) – PavoDive Jun 14 '19 at 11:26

How do I extract specific rows from a CSV and format the data in R?

3 Answers3