I want clean repeated lines in a HTML in R and I have already ended to this. I only want to keep the country names. What is the pattern?
mypage = readLines('http://www.worldslongestwebsite.com')
write(mypage, ("Raw Data.txt"))
mypage[1:1000]
grep('currentVisitor',mypage)
mypage[230:1000]
text<- toString(mypage[230:1000])
text
cleantext<- gsub(pattern="[\"\\<\\>\\=/,:-][0-9]*",replacement= " ",text)
result of a couple of lines
p class c Brazil p div div class d p class a p p class b PM p p class c Albania p div div class d p class a p p class b PM p p class c India p div div class d p class a p p class b PM p