0

I want clean repeated lines in a HTML in R and I have already ended to this. I only want to keep the country names. What is the pattern?

mypage = readLines('http://www.worldslongestwebsite.com')
write(mypage, ("Raw Data.txt"))
mypage[1:1000]
grep('currentVisitor',mypage)
mypage[230:1000]
text<- toString(mypage[230:1000]) 
text
cleantext<- gsub(pattern="[\"\\<\\>\\=/,:-][0-9]*",replacement= " ",text)

result of a couple of lines

p class  c  Brazil  p   div                                                    div class  d   p class  a      p  p class  b     PM  p                        p class  c  Albania  p   div                                                    div class  d   p class  a      p  p class  b     PM  p                        p class  c  India  p   div                                                    div class  d   p class  a      p  p class  b     PM  p    
mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • 1
    Please show us your input which produces the result. And please format your code properly (4 indents at least). – Heri Nov 10 '17 at 18:30
  • 1
    It's been awhile since we've been able to bump this answer: https://stackoverflow.com/a/1732454/1531971 –  Nov 10 '17 at 18:33
  • Since you probably still want to do this even though all is (probably) lost, and the Elder Gods have been summoned: https://www.r-bloggers.com/string-functions-in-r/ –  Nov 10 '17 at 18:36
  • What is "I have ended to this" ? and why not try harder showing us code and expected output? – mplungjan Nov 10 '17 at 21:30

0 Answers0