0

I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.

An example data file is available here: https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data

I am running into a couple road blocks that I could use some input into how navigate.

1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.

Some additional corks to the data: 1) the asteriks are NOT necessary 2) the cognitive TOTAL will never have a value

I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.

Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.

Here is what I have tried:

# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
                  #docvarsfrom = "filenames",
                  dvsep = " ")

From here I do not know how to split the data, in my head I would do something like this

data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))

This of course, gives the correct column names but removes almost all the data I actually am interested in.

Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.

Sincerely,

Alex

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Aswiderski
  • 166
  • 9
  • 2
    It would be very nice if you included data in the question itself. Just a little bit of data. Please have a look at [how to make a reproducible example in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips on sharing some data. `dput()` is great for sharing a copy/pasteable version of data. – Gregor Thomas Sep 05 '19 at 20:56
  • 3
    Looking at your link, it seems like your data has multiple tables, not just one table. A question [like this](https://stackoverflow.com/q/27427229/903061) might help you out. – Gregor Thomas Sep 05 '19 at 20:58
  • That data is difficult to parse because it's comprised of a bunch of distinct table, the names of which should really be values in another column. It's made more difficult by the fact that the test number + name does not appear to be a unique identifier (3. Word Fluency is repeated). I'm afraid the best solution might be an awk script that would turn the data into a single table before you import it with R. I'm assuming that you have a bunch of these data files so wrangling them into a single table by hand is really not an option. – Gregory Sep 05 '19 at 21:01
  • This tutorial sounds very similar to your task: http://rpubs.com/dgrtwo/tidying-enron – Jon Spring Sep 05 '19 at 21:08
  • Thanks for the comments. @Gregor, adding data too the code was tricky given that the question in itself was "how can I wrangle this hot mess" ;) nonetheless, comment recieved and will take you input in mind for future posts. – Aswiderski Sep 06 '19 at 01:31
  • @JonSpring I will check out the link! – Aswiderski Sep 06 '19 at 01:32
  • @Gregor . I'll check out the link! – Aswiderski Sep 06 '19 at 01:33

3 Answers3

1

Here is a way of converting that text file to a dataframe that you can do analysis on

library(tidyverse)

input <- read_lines('c:/temp/scores.txt')

# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'

# add index to the list so we can match the scores that come after
header <- header %>%
  mutate(row = row_number()) %>%
  fill(title)  # copy title down

# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
  mutate(row = row_number())

# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]

# merge the header with the scores to give each section
table <- left_join(scores,
                   header,
                   by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)

# A tibble: 10 x 6
   index  type               Score `T-Score`   row title           
   <chr>  <chr>              <chr> <chr>     <int> <chr>           
 1 "1. "  Line Bisection     9     53            3 Subtest/Section 
 2 "2. "  Semantic Memory    8     51            4 Subtest/Section 
 3 "3. "  Word Fluency       1     56*           5 Subtest/Section 
 4 "4. "  Recognition Memory 40    59            6 Subtest/Section 
 5 "5. "  Gesture Object Use 2     68            7 Subtest/Section 
 6 "6. "  Arithmetic         5     49            8 Subtest/Section 
 7 "7. "  Spoken Words       17    45*          14 Spoken Language 
 8 "9. "  Spoken Sentences   25    53*          15 Spoken Language 
 9 "11. " Spoken Paragraphs  4     60           16 Spoken Language 
10 "8. "  Written Words      14    45*          20 Written Language
  • this is perhaps, the most elegant output. I am going to have to go back and look into some of the lines of code you wrote. Very well done! – Aswiderski Sep 09 '19 at 17:01
0

What is the source for the code at the link provided?

https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data

This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.

This is what was accomplished before hitting a wall.

df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)

df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))

head(df2)

enter image description here

When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.

enter image description here

If anything, it would good to know something about those unfamiliar data types.

Gray
  • 1,164
  • 1
  • 9
  • 23
  • this was an avenue I went down to try and seperate the data as well, but I did not get as far as you. Well done! The data is from a clinical assessment called the Comprehensive Aphasia Test. There are 27 subtests total. The source for the code is from an automated scoring application where clinicians put in the scores from the assessment and then automatically get the T-Scores – Aswiderski Sep 06 '19 at 15:48
0

Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.

library(stringr)
library(dplyr)

df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)

df1 <- df %>% mutate(V2, "Score" =  str_extract(df$V2, "\\d+")  ) 

df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$")) 

df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))

df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))

df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*")) 

df6 <- replace(df5, is.na(df5), "") 


df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$"))   # remove digits, new variable 

df7$V1 <- NULL         # remove variable 
df7$V2 <- NULL         # remove variable 

df8 <- df7[, c(6,3,1,4,2,5)]       # re-align variables
head(df8,15)

enter image description here

Gray
  • 1,164
  • 1
  • 9
  • 23
  • Hi, following-up to learn whether this code worked-out for you and whether or not the code is actually the best solution for your purposes. If so, I would much appreciate your designating the code as the best solution. Regards, – Gray Sep 11 '19 at 16:49