1

I am currently looking at the co-occurrence of various phenomena (gestures, intonation in speech) in time. As such, the data appears with each variable as it's own column, and phenomena are shown as repeating values while they are co-occuring, as :

Begin Time        End Time       g-phasing   apex   syllable    words   tones
00:00:02.000    00:00:04.266                         Zia        j'avais 
00:00:04.266    00:00:05.390    Preparation          Zia        j'avais 
00:00:05.390    00:00:05.519    Preparation           vE        j'avais 
00:00:05.519    00:00:05.852    Preparation           vE        j'avais     H*
00:00:05.852    00:00:05.910    Preparation           de        des         
00:00:05.910    00:00:05.970    Preparation           de        des 
00:00:05.970    00:00:06.236    Preparation           de        des 
00:00:06.236    00:00:06.276    Preparation           di        dizaines    
00:00:06.276    00:00:06.650    Preparation           di        dizaines    
00:00:06.650    00:00:06.795    Preparation          zEn        dizaines    
00:00:06.795    00:00:06.835    stroke               zEn        dizaines    
00:00:06.835    00:00:07.480    stroke               zEn        dizaines    
00:00:07.480    00:00:07.630    stroke        apex   zEn        dizaines    
00:00:07.630    00:00:07.857    stroke               zEn        dizaines    H*
00:00:07.857    00:00:08.080    stroke               zEn        dizaines    
00:00:08.080    00:00:08.120    stroke             ddeux        de  
00:00:08.120    00:00:08.226    Preparation        ddeux        de  
00:00:08.226    00:00:08.290    Preparation        ddeux        de  
00:00:08.290    00:00:08.900    Preparation           sy        sujets  
00:00:08.900    00:00:12.396    Preparation           sy        sujets  
00:00:12.396    00:00:12.410    stroke                sy        sujets  
00:00:12.410    00:00:12.628    stroke                ZE        sujets  
00:00:12.628    00:00:12.776    stroke        apex    ZE        sujets  
00:00:12.776    00:00:12.924    stroke                ZE        sujets  
00:00:12.924    00:00:12.990    stroke                ZE        sujets      H*
00:00:12.990    00:00:13.400    stroke                ZE        sujets  

This dataset shows that there are two strokes (one from 00:00:06.795 to 00:00:08.120, and a second one from 00:00:12.396 to 00:00:13.400)

Ideally I would like to be able to count the number of strokes in the dataset, determine how many overlap with a pitch accented syllable (here, the "H*" value in the "tones" column that correspond to the syllables "zEn" and "ZE"), how many do not co-occur with a pitch-accented syllable, etc.

I'm not sure if I should iterrate over rows and create counters, if I should make use of the begin and end times, or if I should restructure the data.. Any help would be greatly appreciated!

proh47
  • 11
  • 1

1 Answers1

0

Probably helpful: Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Closely looking at the dataset, sounds like a range aggregation problem is embedded. Once you filter for rows containing stroke data only using something like: How to select rows from a DataFrame based on column values? I recommend extracting out the first two columns into an isolated dataframe/list and work on generating a list of merged intervals.

initialize output_dataframe_list as an empty list For each (start, end) item in merged_intervals_list: do extract the subset of rows that have column 1>= start and column 2<=end and have the word stroke. use a groupby aggregation method on it and obtain your needed dataframe. append this dataframe containing the results of the interval into output_dataframe_list concatenate all the dataframes together to create a new dataframe using pd.concat

synaptikon
  • 699
  • 1
  • 8
  • 16