Longest line in text dataset

Question

I am looking for a way to find the length of the longest line in a text file.

E.g. consider a simple dataset from the tm package.

install.packages("tm")
library(tm)
txt <- system.file("texts", "txt", package = "tm") 

ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl = 
list(language = "lat"))

length(ovid)
[1] 5

ovid is composed by five documents each one composed by a character vector of n elements (from 16 to 18), between which I would like to identify the longest. I found documentation for python, C# and for bash shell but, surprisingly, I did not find anything with R. Because of that, my attempts were quite naive, with:

max(nchar(ovid))
[1] 5410
max(length(ovid))
[1] 5

Thank you for your suggestion, but I am afraid it did not return the correct answer. — Worice, Apr 20 '16 at 08:53
What would the correct answer be? `ovid[which.max(nchar(ovid))]`? — Sotos, Apr 20 '16 at 09:15
The fifth [5] should be the longest. Since the dataset is small, I have been able to visually check the length of the lines. Thank you @Sotos for suggesting this clarification. — Worice, Apr 20 '16 at 09:17
Unfortunately, it doesn't. Should I edit the question with the results of @RichardTelford? The first result in "1", the second reports the vCorpus information. Did you get the same result I got? Perhaps, I am applying it wrongly. — Worice, Apr 20 '16 at 09:23
You want the maximum `length` from the five elements? I.e. `max` of `lengths(lapply(ovid, as.character))`? — alexis_laz, Apr 20 '16 at 10:01
Yes, @alexis_laz. I would identify the longest of the five elements. But you have just gave me the correct input to solve the problem. The function `which.max(lengths(lapply(ovid, as.character)))` returns, correctly, the fifth line as the longest. Thank you very much! — Worice, Apr 20 '16 at 10:04
Actually what that command returns is the length of the longest character vector in the list that is `ovid`. The object `ovid` is a list of character vectors, and the fifth element of the list has 18 elements, three of which are the empty string. Look at `lapply(ovid, as.character))` and this will be clear. — Ken Benoit, Apr 20 '16 at 14:05
@KenBenoit you are right. Now I understand my mistake. Guys, I apologize for stating the problem incorrectly. I really appreciate your efforts. — Worice, Apr 20 '16 at 18:57

score 1 · Accepted Answer · edited May 23 '17 at 12:16

Actually it's the fourth text which is the longest, once we remove the padding from whitespace. Here's how. Note that a lot of this comes from the difficulty of getting texts out of a tm (V)Corpus object, which has been asked (several times) before, for instance here.

Note that I am interpreting your question about "lines" as referring to the five documents, which are more than five lines each, but consist of multiple lines (between 16 and 18 length character vectors each). I hope I have interpreted this correctly.

texts <- sapply(ovid$content, "[[", "content")
str(texts)
## List of 5
## $ : chr [1:16] "    Si quis in hoc artem populo non novit amandi," "         hoc legat et lecto carmine doctus amet." "    arte citae veloque rates remoque moventur," "         arte leves currus: arte regendus amor." ...
## $ : chr [1:17] "    quas Hector sensurus erat, poscente magistro" "         verberibus iussas praebuit ille manus." "    Aeacidae Chiron, ego sum praeceptor Amoris:" "         saevus uterque puer, natus uterque dea." ...
## $ : chr [1:17] "    vera canam: coeptis, mater Amoris, ades!" "    este procul, vittae tenues, insigne pudoris," "         quaeque tegis medios, instita longa, pedes." "    nos venerem tutam concessaque furta canemus," ...
## $ : chr [1:17] "    scit bene venator, cervis ubi retia tendat," "         scit bene, qua frendens valle moretur aper;" "    aucupibus noti frutices; qui sustinet hamos," "         novit quae multo pisce natentur aquae:" ...
## $ : chr [1:18] "    mater in Aeneae constitit urbe sui." "    seu caperis primis et adhuc crescentibus annis," "         ante oculos veniet vera puella tuos:" "    sive cupis iuvenem, iuvenes tibi mille placebunt." ...

So here we have extracted the texts, but they are on multiple lines represented by one element each of the character vectors that each "document" comprises, and because they are verses, there is variable white space padding at the beginning and end of some of these elements. Let's trim these and just leave the text, using stringi's stri_trim_both function.

# need to trim leading and trailing whitespace
texts <- lapply(texts, stringi::stri_trim_both)
## texts[1]
## [[1]]
## [1] "Si quis in hoc artem populo non novit amandi,"     "hoc legat et lecto carmine doctus amet."          
## [3] "arte citae veloque rates remoque moventur,"        "arte leves currus: arte regendus amor."           
## [5] ""                                                  "curribus Automedon lentisque erat aptus habenis," 
## [7] "Tiphys in Haemonia puppe magister erat:"           "me Venus artificem tenero praefecit Amori;"       
## [9] "Tiphys et Automedon dicar Amoris ego."             "ille quidem ferus est et qui mihi saepe repugnet:"
## [11] ""                                                  "sed puer est, aetas mollis et apta regi."         
## [13] "Phillyrides puerum cithara perfecit Achillem,"     "atque animos placida contudit arte feros."        
## [15] "qui totiens socios, totiens exterruit hostes,"     "creditur annosum pertimuisse senem."              

# now paste them together to make a single character vector of the five documents
texts <- sapply(texts, paste, collapse = "\n")
str(texts)
##  chr [1:5] "Si quis in hoc artem populo non novit amandi,\nhoc legat et lecto carmine doctus amet.\narte citae veloque rates remoque movent"| __truncated__ ...
cat(texts[1])
## Si quis in hoc artem populo non novit amandi,
## hoc legat et lecto carmine doctus amet.
## arte citae veloque rates remoque moventur,
## arte leves currus: arte regendus amor.
## 
## curribus Automedon lentisque erat aptus habenis,
## Tiphys in Haemonia puppe magister erat:
## me Venus artificem tenero praefecit Amori;
## Tiphys et Automedon dicar Amoris ego.
## ille quidem ferus est et qui mihi saepe repugnet:
##     
## sed puer est, aetas mollis et apta regi.
## Phillyrides puerum cithara perfecit Achillem,
## atque animos placida contudit arte feros.
## qui totiens socios, totiens exterruit hostes,
## creditur annosum pertimuisse senem.

That's looking more like it. Now we can figure out which was longest.

nchar(texts)
## [1] 600 621 644 668 622
which.max(nchar(texts))
## [1] 4

You interpreted correctly. It has been my fault. I actually confused the five documents, thinking them as "lines" of length 16, 18 ecc. The real object of the question, hence, is le length of the character vectors composing the five documents. Thank you for make it clear to me. I really appreciate you effort. — Worice, Apr 20 '16 at 19:11
You're welcome. Sounds like an answer worth accepting then :-). — Ken Benoit, Apr 20 '16 at 19:22
Absolutely. With a point too. It helped me with my first heavy step in the world of data mining! — Worice, Apr 20 '16 at 19:28

Longest line in text dataset

1 Answers1