2

Introduction

I am trying to populate a data table from a large string that I acquired via webscraping. I intended to break the large chunk of text down to smaller bits, using a certain pattern as a reference. From these smaller bits I would create the variables that would go into the columns of the data table. Just to give you some context: I would like to learn how the members of the lower house of the Brazilian Congress voted when each bill was appreciated.

Sample result

Each part was supposed to look like this:

SESSÃO ORDINÁRIA Nº 008 - 10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tPresente <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint"> \n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint" align="center">\n\t\t\t\t\t\t\t\t\t\t\t---\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PL&Numero=7735&Ano=2014"> PL Nº 7735/2014\n\t\t\t\t\t\t\t\t\t - DVS - PRB - EMENDA Nº 193\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<tr class="even">\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t

Just so you can understand the pattern I'm about to show you and that I used in order to pass str_extract_all, the big string continues like this:

SESSÃO EXTRAORDINÁRIA Nº 009 - 10/02/2015\n ...

Method

The code was supposed to extract the text between two "SESSÃO"'s using str_extract_all(html, "SESSÃO.*?(?=SESSÃO)") (html is the large string). However, if I used the code precisely like that, R would return an empty list.

I know that line breaks (\n) are causing the problem, because I was able to reach a very similar result as shown above by deleting the \n's from the main text applying str_replace_all to html, with "\n" as the pattern and "" as the replacement. I then called the result html1 and ran str_extract with x = html1 instead of x = html and the same pattern.

Question

So my question is: can I tell str_extract_all to ignore \n? If not, is there another way I can treat this problem? I didn't want to delete the \n's, since they may come in handy when breaking down further the smaller bits of string.

Additional sample string

As requested by andrew_reece, this is an expanded version of the sample string:

SESSÃO ORDINÁRIA Nº 008 - 10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tPresente <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint"> \n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint" align="center">\n\t\t\t\t\t\t\t\t\t\t\t---\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PL&Numero=7735&Ano=2014"> PL Nº 7735/2014\n\t\t\t\t\t\t\t\t\t - DVS - PRB - EMENDA Nº 193\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<tr class="even">\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tSESSÃO EXTRAORDINÁRIA Nº 009 - 10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tPresente <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint"> \n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint" align="center">\n\t\t\t\t\t\t\t\t\t\t\t---\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PEC&Numero=358&Ano=2013"> PEC Nº 358/2013\n\t\t\t\t\t\t\t\t\t - PROPOSTA DE EMENDA À CONSTITUIÇÃO - 2º TURNO\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PEC&Numero=358&Ano=2013"> PEC Nº 358/2013\n\t\t\t\t\t\t\t\t\t - DVS - PSOL - ART. 2º DA PEC Nº 358/2013 - 2º TURNO\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<tr class="even">\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t11/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t

Desired result

List with the following elements:

[1]

SESSÃO ORDINÁRIA Nº 008 - 10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tPresente <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint"> \n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint" align="center">\n\t\t\t\t\t\t\t\t\t\t\t---\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PL&Numero=7735&Ano=2014"> PL Nº 7735/2014\n\t\t\t\t\t\t\t\t\t - DVS - PRB - EMENDA Nº 193\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<tr class="even">\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t

[2]

SESSÃO EXTRAORDINÁRIA Nº 009 - 10/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\tPresente <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint"> \n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint" align="center">\n\t\t\t\t\t\t\t\t\t\t\t---\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PEC&Numero=358&Ano=2013"> PEC Nº 358/2013\n\t\t\t\t\t\t\t\t\t - PROPOSTA DE EMENDA À CONSTITUIÇÃO - 2º TURNO\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<a href="http://www.camara.leg.br/sileg/Prop_Lista.asp?Sigla=PEC&Numero=358&Ano=2013"> PEC Nº 358/2013\n\t\t\t\t\t\t\t\t\t - DVS - PSOL - ART. 2º DA PEC Nº 358/2013 - 2º TURNO\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t <sup style="font-size:10px">\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tSim \n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<tr class="even">\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t11/02/2015\n\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t<td class="borderPrint">\n\t\t\t\t\t\t\t\t\t\t\t

Felipe Ito
  • 237
  • 1
  • 2
  • 5

1 Answers1

0

Without example data, this is not tested, but I believe that what you need is

str_extract_all(html, regex("SESSÃO.*?(?=SESSÃO|$)", dotall = TRUE))

Note that I added |$ so that it will detect the last group in your data, but the main point is to include dotall = TRUE

G5W
  • 36,531
  • 10
  • 47
  • 80