0

'''

                A stray SKATEBOARD clips her, causing her to stumble and 

      spill her coffee, as well as the contents of her backpack.

      

      The young RIDER dashes over to help, trembling when he sees 

      who his board has hit.

      

                             RIDER

                Hey -- sorry.

      

      Cowering in fear, he attempts to scoop up her scattered 

      belongings.

      

                             KAT

                Leave it 

      

      He persists.

      

                             KAT (continuing)

                I said, leave it!

      

                             RIDER

                Hey -- sorry.

''''

I'm scraping some scripts that I want to do some text analysis with. I want to pull only dialogue from the scripts and it looks like it has a certain amount of spacing. So for example, I want that line "Hey -- sorry.". I know that the spacing is 20 and that is consistent throughout the script. So I how can I only read in that line and the rest that have equal spacing?

I want to say that I am going to use read.fwf, reading a fixed width.

What do you guys think?

I'm scraping from urls like this: https://imsdb.com/scripts/10-Things-I-Hate-About-You.html

bob0901
  • 65
  • 5

2 Answers2

0
library(tidytext)
library(tidyverse)

text <- c("PADUA HIGH SCHOOL - DAY
          
          Welcome to Padua High School,, your typical urban-suburban 
          high school in Portland, Oregon.  Smarties, Skids, Preppies, 
          Granolas. Loners, Lovers, the In and the Out Crowd rub sleep 
          out of their eyes and head for the main building.
          
          PADUA HIGH PARKING LOT - DAY
          
          KAT STRATFORD, eighteen, pretty -- but trying hard not to be 
          -- in a baggy granny dress and glasses, balances a cup of 
          coffee and a backpack as she climbs out of her battered, 
          baby blue '75 Dodge Dart.
          
          A stray SKATEBOARD clips her, causing her to stumble and 
          spill her coffee, as well as the contents of her backpack.
          
          The young RIDER dashes over to help, trembling when he sees 
          who his board has hit.
          
                                 RIDER
                    Hey -- sorry.
          
          Cowering in fear, he attempts to scoop up her scattered 
          belongings.
          
                                 KAT
                    Leave it 
          
          He persists.
          
                                 KAT (continuing)
                    I said, leave it!
          
          She grabs his skateboard and uses it to SHOVE him against a 
          car, skateboard tip to his throat.  He whimpers pitifully 
          and she lets him go.  A path clears for her as she marches 
          through a pack of fearful students and SLAMS open the door, 
          entering school.
          
          INT. GIRLS' ROOM - DAY
          
          BIANCA STRATFORD, a beautiful sophomore, stands facing the 
          mirror, applying lipstick.  Her less extraordinary, but 
          still cute friend, CHASTITY stands next to her.  
          
                                 BIANCA
                    Did you change your hair?
          
                                 CHASTITY 
                    No.
          
                                 BIANCA
                    You might wanna think about it
          
          Leave the girls' room and enter the hallway.
          
          HALLWAY - DAY- CONTINUOUS
          
          Bianca is immediately greeted by an admiring crowd, both 
          boys
          and girls alike.
          
                                 BOY
                           (adoring)
                    Hey, Bianca.
          
                                 GIRL
                    Awesome shoes.
          
          The greetings continue as Chastity remains wordless and 
          unaddressed by her side.  Bianca smiles proudly, 
          acknowledging her fans.
          
          GUIDANCE COUNSELOR'S OFFICE - DAY
          
          CAMERON JAMES, a clean-cut, easy-going senior with an open, 
          farm-boy face, sits facing Miss Perky, an impossibly cheery 
          guidance counselor.")
          
          

names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")

text %>% 
  as_tibble() %>% 
  unnest_tokens(text, value, token = "lines") %>% 
  filter(str_detect(text, "\\s{15,}")) %>% 
  mutate(text = str_trim(text)) %>% 
  filter(!str_detect(text, names_stopwords)) 

Output:

# A tibble: 9 x 1
  text                          
  <chr>                         
1 hey -- sorry.                 
2 leave it                      
3 i said, leave it!             
4 did you change your hair?     
5 no.                           
6 you might wanna think about it
7 (adoring)                     
8 hey, bianca.                  
9 awesome shoes. 

You can include further character names in the names_stopwords vector.

Desmond
  • 1,047
  • 7
  • 14
  • Thanks for the reply. This looks like exactly what I need. How do I only keep "hey-- sorry." Something like "\\s{20...? – bob0901 Apr 15 '21 at 03:13
  • In https://imsdb.com/scripts/10-Things-I-Hate-About-You.html that you linked, could you share more examples of text portions you wish to keep, so I can try writing a more robust regex? – Desmond Apr 15 '21 at 03:18
  • I'm now I realizing that the amount of spacing will change depending on the script. The amount of spacing will stay consistent throughout. I'd like to know this portion "\\s{15,}". I will update my post right now with more lines from that script – bob0901 Apr 15 '21 at 03:27
  • I do not want character names. Like Kat and Rider. Just their dialogue. – bob0901 Apr 15 '21 at 03:32
  • Take a look at this cheat sheet: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf "\\s{15,}" filters for lines with 15 or more whitespaces. You could add the character names as a list of stopwords to filter them out too. – Desmond Apr 15 '21 at 03:35
  • If I change "\\s{15,}" to "\\s{20,}" that is for 20 whitespaces? I like this cheatsheet. I'm going to look into it tomorrow morning – bob0901 Apr 15 '21 at 03:40
  • I updated my answer by increasing the text input and amending the text filter to be more robust. Please mark it with a tick if this answers your question. "\\s{20,}" means *20 or more* white spaces. Take a look at the Quantifiers section in the cheat sheet for examples. – Desmond Apr 15 '21 at 04:16
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231191/discussion-between-bob0901-and-desmond). – bob0901 Apr 16 '21 at 01:43
0

You can try the following :

url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'

url %>%
  #Read webpage line by line
  readLines() %>%
  #Remove '<b>' and '</b>' from string
  gsub('<b>|</b>', '', .) %>%
  #select only the text which begins with 20 whitespace characters
  grep('^\\s{20,}', ., value = TRUE) %>%
  #Remove whitespace
  trimws() %>%
  #Remove all caps string
  grep('^([A-Z]+\\s?)+$', ., value = TRUE, invert = TRUE)

#[1] "Hey -- sorry."             "Leave it"                  "KAT (continuing)"
#[4] "I said, leave it!"         "Did you change your hair?" "No."
#...
#...

I have tried cleaning this as much as possible but might require some more cleaning based on what you actually want to extract.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213