4

I have a large dataframe (approximately 40000 x 500) that I read from a .CSV file. The dataset contains a list of error code combinations, with each combination specifically listing all the unique codes. I'm searching for the rows that contain a certain piece of string, like "name=", to see if it is known who listed that combination. The problem is that I do not know in which column this may occur, since the 10th column and onwards are irregular per row, as the number of error codes per combination is irregular. This means that after some point the rows are empty, except for the biggest error code combination. I only know that if the row contains the piece of string I'm looking for, it will be in one of the last 6 columns of a row containing any information. If the piece of string I'm looking for is found in a row, I would like an answer in a different column.

I have extensively looked for a solution, but I couldn't find anything that serves my needs. Also, I'm new to R.

[Fake dataset]

  Year  Problem_ID  Number.of.errors  X      X.1    X.2       X.3       X.4
1 2005  Server      2                 X-4555 X-1222 name=Sara
2 2011  Hardware    3                 X-8922 X-7644 X-6485
3 2015  Software    4                 X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                 X-6901 X-1121  
5 2007  Server      3                 X-7655 X-6555 X-8999    name=Sara

[Desired output]

  Year  Problem_ID  Number.of.errors Name X      X.1    X.2       X.3       X.4
1 2005  Server      2                1    X-4555 X-1222 name=Sara
2 2011  Hardware    3                0     X-8922 X-7644 X-6485
3 2015  Software    4                1     X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                0     X-6901 X-1121  
5 2007  Server      3                1     X-7655 X-6555 X-8999    name=Sara
smci
  • 32,567
  • 20
  • 113
  • 146
Joey
  • 67
  • 1
  • 6
  • 2
    This sounds very doable (using `apply`, for example). Can you please make a small fake dataset (a few rows, columns) and what the expected result should look like. Would you like the column/row value, index where the phrase occurs...? Once you do, please ping me in the comment section and I might be able to help (unless akrun beats me to it :) ). – Roman Luštrik May 07 '18 at 09:08
  • 3
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos May 07 '18 at 09:08
  • Welcome to stack! If you want some help here, you need to provide an example on which people can make test, and use to provide an answer. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – denis May 07 '18 at 09:09
  • @RomanLuštrik Thanks for your willingness to help! I have a added a small fake dataset as requested. – Joey May 07 '18 at 09:27
  • 1
    *"I'm searching for the rows that contain a certain piece of string, like "name=", to see if it is known who listed that combination."* Do you only want to create a new integer column `name=0/1` according as whether the substring `"name=..."` occurred in the rightmost fields? That's all? And we just throw away the "Sara/John/"etc. value? And we don't need to record what column it was found in? Please state the task clearly. – smci May 07 '18 at 09:32
  • Additionally to the question from @smci: Could the string "name=" appear more than once per row? – Tobias May 07 '18 at 09:45
  • @smci your assumptions there are correct. I'm not interested in anything else :) – Joey May 07 '18 at 10:45
  • @Tobias No, "name=" can only appear once per row – Joey May 07 '18 at 10:46
  • I'm fighting back the urge to rewrite the question down to the one-liner *"I want to create an integer column `Name=0/1` according as whether the string "name=" occurs in any of the columns from 6 onwards"*. All the rest is unnecessary. Even the criterion that "name=" can only appear once per row is irrelevant, it doesn't affect the answer. Btw, we could pretty much for free extract the actual `Name=John/Sara/...` and store it in the Name column, or '' for no match. Then you can just test `as.numeric(df$Name != '')` to get your 0/1. But whatevs... – smci May 07 '18 at 23:15

2 Answers2

4

Option #1:

You can use apply as suggested by @RomanLuštrik to get name column. One has to check if any column in that row has matching text to name= using any and grep. The solution will be like:

df$name <- apply(df, 1, function(x)as.integer(any(grep("name=",x))))

df
#   Year Problem_ID Number.of.errors      X    X.1       X.2       X.3       X.4 name
# 1 2005     Server                2 X-4555 X-1222 name=Sara      <NA>      <NA>    1
# 2 2011   Hardware                3 X-8922 X-7644    X-6485      <NA>      <NA>    0
# 3 2015   Software                4 X-8494 X-4321    X-7452    X-5321 name=John    1
# 4 2018   Hardware                2 X-6901 X-1121      <NA>      <NA>      <NA>    0
# 5 2007     Server                3 X-7655 X-6555    X-8999 name=Sara      <NA>    1

Option#2: In case OP is interested in only rows which contains name= text then it would be user csv line-wise and then filter out rows containing name=.

  All_lines <- readLines(file_name)

  desired_lines <- grep("name=", All_lines, value = TRUE)

The desired_lines got only those rows which contains name=.

Data:

df <- read.table(text = 
"Year  Problem_ID  Number.of.errors  X      X.1    X.2       X.3       X.4
1 2005  Server      2                 X-4555 X-1222 name=Sara NA       NA
2 2011  Hardware    3                 X-8922 X-7644 X-6485    NA       NA
3 2015  Software    4                 X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                 X-6901 X-1121 NA       NA       NA 
5 2007  Server      3                 X-7655 X-6555 X-8999    name=Sara    NA",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
1

As things change, you can use dplyr filter and if_any

your_data %>%
 filter(if_any(everything(), ~str_detect(tolower(.), "name=")))
FabianUx
  • 32
  • 5