0

Have a list of text-sections which are required to be split into sentences by:

>  textList <- list(sections=sections[(length(sections)-2):length(sections)])
>  textList$sentences <- sapply(textList$sections, function(x) strsplit(as.character(x), "(?<=und/KON)\\s(?!\\S+/V)|(?<=oder/KON)\\s|(?<=/\\$[[:punct:]])\\s(?!dass/KOUS)(?!dann/ADV)(?!weil/KOUS)", perl=TRUE))
>  sent <- textList$sentences

The final goal is to add IDs to all sentences and arrange them together into a list of dataframes --one dataframe corresponding to each section.

>  sent.list <- lapply(seq_along(sent), function(i)
+                               data.frame(ID=paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])), sep = ""),
+                                          Sentence=sent[[i]]))
Error in data.frame(ID = paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])),  : 
  arguments imply differing number of rows: 1, 0

ISSUE: However I try to variate the split in the first step, somehow it seems I get a list with exactly one character(0) element (the last one). This hinders the execution of the second step --creating the list of dataframes-- with the error above.

Please note that the structure of the list seems somehow corrupted. Downwards --R console copy-paste-- the first two sections are beginning (at #*) with $... #* (which btw. I cannot interpret meaningfully). However, the third section (at #**) starts with [[3]].

>  sent
$... #*
 [1] "Das/ART Spiel/NN besteht/VVFIN aus/APPR mehreren/PIAT Früchten/NN -LRB-/TRUNC rote/ADJA Kirschen/NN ,/$," 
 .
 .
 . 
[51] "-RRB-/TRUNC sie/PPER bleiben/VVFIN die/ART ganze/ADJA Zeit/NN über/APPR konzetriert/ADJD bei/APPR der/ART Sache/NN ./$."                                                                       
[52] "Das/ART Spiel/NN ist/VAFIN eine/ART absolue/ADJA Kaufempfehlung/NN !!!!/CARD "                                                                                                                 

$... #*
 [1] "Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$."  
 .
 .
 .                    
[36] "hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$."                                                                                                                                

[[3]] #**
character(0)

I tried much to reproduce the error on artificially reproduced data without much success. So please excuse the complicated code.

The smallest version of textList for which I could reproduce the error when executed in the R console:

> textList
$sections
[1] "Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$. Preis/NN führt/VVFIN ,/$, aus/APPR einem/ART einfachen/ADJA Spiel/NN schnell/ADJD einen/ART hochwertigen/ADJA und/KON hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$. "
[2] ""    

Following the content of a dput file containing the smallest version of textList which reproduces the example.

structure(list(sections = c("Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$. Die/ART Spielidee/NN ist/VAFIN wie/KOKOM bei/APPR allen/PIDAT Spielen/NN mit/APPR dieser/PDAT Zielaltersklasse/NN außerordentlich/ADJD einfach/ADJD ./$. Hier/ADV geht/VVFIN es/PPER darum/PROAV ,/$, reihum/ADV zu/PTKZU würfeln/VVINF ./$. Der/ART Würfel/NN zeigt/VVFIN keine/PIAT Zahlen/NN ,/$, sondern/KON vier/CARD Farben/NN ,/$, einen/ART Raben/NN und/KON einen/ART Obstkorb/NN ./$. Bei/APPR einer/ART Farbe/NN darf/VMFIN man/PIS ein/ART Stück/NN Obst/NN von/APPR einem/ART der/ART vier/CARD Obstbäume/NN im/APPRART Obstgarten/NN pflücken/VVFIN ,/$, bei/APPR einem/ART Raben/NN muss/APPR eines/ART von/APPR neun/CARD Rabenpuzzleteilen/NN gelegt/VVPP werden/VAINF ,/$, bei/APPR einem/ART Obstkorb/NN darf/VMFIN man/PIS zwei/CARD Obststücke/NN nach/APPR Wahl/NN abräumen/VVINF ./$. Entweder/KON es/PPER gewinnen/VVFIN alle/PIS ,/$, weil/KOUS alles/PIS Obst/NN abgeerntet/VVPP ist/VAFIN ,/$, bevor/KOUS der/ART Rabe/NN fertig/ADJD gepuzzlet/VVPP wurde/VAFIN oder/KON es/PPER verlieren/VVFIN alle/PIDAT gegen/APPR den/ART fertigen/ADJA Raben/NN ./$. Die/ART Idee/NN eines/ART ``/CARD kooperativen/ADJA ''/ADJA Spiels/NN hat/VAFIN viele/PIDAT Freunde/NN ,/$, macht/VVFIN das/ART Spiel/NN aber/ADV noch/ADV langweiliger/ADJD ,/$, als/KOUS es/PPER unbedingt/ADV nötig/ADJD wäre/VAFIN ./$. Unser/PPOSAT vierjähriger/ADJA Sohn/NN versucht/VVFIN schon/ADV so/ADV zu/PTKZU mogeln/VVINF ,/$, dass/KOUS der/ART Rabe/NN gewinnt/VVFIN -/$( einfach/ADV um/APPR mehr/PIAT Pepp/NN in/APPR das/ART Spiel/NN zu/PTKZU bringen/VVINF ./$. Selbst/ADV unsere/PPOSAT zweijährige/ADJA Tochter/NN wagt/VVFIN sich/PRF schon/ADV an/APPR die/ART Regeln/NN ,/$, wenn/KOUS sie/PPER sich/PRF spielerisch/ADJD dem/ART Diktat/NN des/ART Würfels/NN verweigert/VVFIN und/KON erklärt/VVFIN ,/$, jedes/PIDAT Obst/NN zu/PTKZU pflücken/VVINF ,/$, aber/KON bei/APPR einem/ART roten/ADJA Würfel/NN keine/PIAT rote/ADJA Kirsche/NN ./$. Das/ART Spiel/NN besticht/VVFIN vor/APPR allem/PIS durch/APPR die/ART Qualität/NN seiner/PPOSAT Verarbeitung/NN ./$. Die/ART Obstsorten/NN sind/VAFIN gut/ADJD gestaltete/ADJA und/KON lackierte/ADJA Holzstücke/NN ./$. Die/ART Kirschen/NN hängen/VVFIN paarweise/ADV am/APPRART Baum/NN und/KON auch/ADV die/ART Obstkörbe/NN sind/VAFIN liebevoll/ADJD geflochten/VVPP ./$. Solch/PIDAT ein/ART Spiel/NN packt/VVFIN man/PIS immer/ADV wieder/ADV gerne/ADV aus/PTKVZ ./$. Besonders/ADV schön/ADJD ist/VAFIN die/ART Sonderedition/NN im/APPRART Blechkasten/NN statt/APPR im/APPRART Pappkarton/NN ./$. Warum/PWAV Spielehersteller/NN sich/PRF immer/ADV wieder/ADV vor/APPR den/ART Kosten/NN einer/ART hochwertigen/ADJA Herstellung/NN drücken/VVINF bleibt/VVFIN ein/ART ungeklärtes/ADJA Geheimnis/NN ,/$, zumal/KOUS so/ADV schöne/ADJA Spiele/NN wie/KOKOM Obstgarten/NN beweisen/VVFIN ,/$, dass/KOUS eine/ART hochwertige/ADJA und/KON liebevolle/ADJA Gestaltung/NN ,/$, die/PRELS selbstverständlich/ADJD zu/APPR einem/ART etwas/ADV höheren/ADJA Preis/NN führt/VVFIN ,/$, aus/APPR einem/ART einfachen/ADJA Spiel/NN schnell/ADJD einen/ART hochwertigen/ADJA und/KON hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$. ", 
"")), .Names = "sections")
smci
  • 32,567
  • 20
  • 113
  • 146
alex
  • 1,103
  • 1
  • 14
  • 25
  • 4
    please `dput` a small subset of your data that still causes the problem. – BrodieG Jan 02 '14 at 20:27
  • 4
    A [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), which we can check and debug on our machines, helps us help you. – Blue Magister Jan 02 '14 at 20:33
  • @BrodieG @BlueMagister Thank you very much. I try but I don't know exactly how. Can I somehow post the `.Rdata` file? As mentioned, I cannot reproduce the list. I add the R output of the shortest object version which produces the error now. – alex Jan 02 '14 at 20:57
  • how big is the smallest version of `sections` which still produces the error? – BrodieG Jan 02 '14 at 20:59
  • @BrodieG `1,37 KB (1.412 Bytes)`,according to Windows (the .Rdata file) – alex Jan 02 '14 at 21:03
  • You've been told in two different ways already how to add reproducible data to your question. Until you actually follow those instructions, no one can help you. – joran Jan 02 '14 at 21:08
  • In case you did not realize that I had embedded a link: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Blue Magister Jan 02 '14 at 21:21
  • @joran accomplished :) Sorry I'm a bit new in this field. Thank you! – alex Jan 02 '14 at 21:22
  • @al remove the part from "Plase note that" to "EDIT3start". to clean your question – agstudy Jan 02 '14 at 21:24
  • 1
    The example data you posted is a list, containing a character vector of length 2. The second element of that vector is the character string `""`, i.e. it is empty. What do you expect to happen when splitting that element? – joran Jan 02 '14 at 21:25
  • @joran I would like that even if there are some empty vectors when splitting to avoid the creation of a list which cannot be executed by the following `sapply`. I tried to eliminate the elements, but whidout much success. Thank you! :) – alex Jan 02 '14 at 21:31
  • 1
    So only run it on elements with `length() > 0`. This is just a matter of subsetting. – joran Jan 02 '14 at 21:35
  • @joran: it is indeed a language wart that `strsplit("","")` returns `list(character(0))` with length 1, rather than `list()` with length 0. And `stringr::str_split("","")` returns `list("")` with length 1. It's all a little mad. – smci May 17 '14 at 21:21

1 Answers1

4

Just remove element with length equal to 0:

sent <- unlist(sent,recursive=FALSE)
sent <- sent[lapply(sent,length)>0]

EDIT OP seems to have problems on how to reproduce the error , I show here how to reproduce it:

Using this as sent for example:

sent = list("a",character(0))  ## you get an error because of character(0)

 lapply(seq_along(sent), 
           function(i)
             data.frame(ID=paste(sprintf("%02d", i),
                                  sprintf("%03d", seq_along(sent[[i]])), sep = ""),
                        Sentence=sent[[i]]))

Reproduce the error :

Error in data.frame(ID = paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])),  : 
  arguments imply differing number of rows: 1, 0
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • A1! Thank you very much. I struggled 2 days on this. Do you know why the first element in the list is printed `$... ` and the empty one like `[[n]]`? It is somehow confusing. – alex Jan 03 '14 at 11:38
  • 1
    I guess because `sapply` try to name your list. Maybe you should use `lapply` to remove all names and just get an indexed list. – agstudy Jan 03 '14 at 11:48