0

I have a problem of reading some of the html subsites. Most of them work just fine but for e.g http://www-history.mcs.st-andrews.ac.uk/Biographies/De_Morgan.html have empty lines in H1 and H3. Because of that my data.frame is a total mess when it comes to that people e.g : data frame example. Frame containts 4 columns "Name" "Date and place of birth" "Date and place of deat" "Link". Im supossed to make a table in LaTeX, but because of those rows with whitespace my tab at some points goes in wrong direction and a guys name is his date of birth and so on. To read that sites im using simply using loop from j=1 to length(LinkiWlasciwy)

matematyk=LinkWlasciwy[j] %>% read_html() %>% html_nodes(selektor1) %>% html_text()

where selektor1="h3 font , h1". After that i save it contains to .txt file and read it in another script where i am supposed to make .tex file based out of these data. In my opinion it would be best to just delete lines in file that only contains whitespace such as space,\n etc. In my txt file for e.g.

Marie-Sophie Germain| 1 April 1776

in Paris, France| 27 June 1831

in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|

As a separator i use " | " . Not all of them are the same, some contains only one space, some two and etc. All i want is to bring every wrong record to this

Marie-Sophie Germain| 1 April 1776 in Paris, France| 27 June 1831 in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|

I had to delete http:// from the text samples because i dont have yet 10 reputation and they are counted as links

  • 1
    ([^ \t])[ \t]+$ , look at this post http://stackoverflow.com/questions/9532340/how-to-remove-trailing-white-spaces-using-a-regular-expression-without-removing –  Feb 28 '16 at 10:08
  • Thank you very much, i couldnt find the topic earlier. – Karol Kreczman Feb 28 '16 at 10:49

1 Answers1

0

You can use library stringi:

library(stringi)
line<-c("Marie-Sophie Germain| 1 April 1776",
" ",
"in Paris, France| 27 June 1831",
"   ",
"in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|")

line2<- line[stri_count_regex(line, "^[ \\t]+$") ==0]
line2
stri_paste(line2, collapse="")

Result:

[1] "Marie-Sophie Germain| 1 April 1776in Paris, France| 27 June 1831in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|"
bartoszukm
  • 693
  • 3
  • 10