I have a problem of reading some of the html subsites. Most of them work just fine but for e.g http://www-history.mcs.st-andrews.ac.uk/Biographies/De_Morgan.html have empty lines in H1 and H3. Because of that my data.frame is a total mess when it comes to that people e.g : data frame example. Frame containts 4 columns "Name" "Date and place of birth" "Date and place of deat" "Link". Im supossed to make a table in LaTeX, but because of those rows with whitespace my tab at some points goes in wrong direction and a guys name is his date of birth and so on. To read that sites im using simply using loop from j=1 to length(LinkiWlasciwy)
matematyk=LinkWlasciwy[j] %>%
read_html() %>%
html_nodes(selektor1) %>%
html_text()
where selektor1="h3 font , h1". After that i save it contains to .txt file and read it in another script where i am supposed to make .tex file based out of these data. In my opinion it would be best to just delete lines in file that only contains whitespace such as space,\n etc. In my txt file for e.g.
Marie-Sophie Germain| 1 April 1776
in Paris, France| 27 June 1831
in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
As a separator i use " | " . Not all of them are the same, some contains only one space, some two and etc. All i want is to bring every wrong record to this
Marie-Sophie Germain| 1 April 1776 in Paris, France| 27 June 1831 in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
I had to delete http:// from the text samples because i dont have yet 10 reputation and they are counted as links