0

I have a one-column CSV. Each cell has a large text that's the source code from a page in a database of company descriptions (I'm only including a small portion of the text, it's more extensive.) There are about 30,000 cells/companies in the CSV.



structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>", 
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

I'd like to extract specific elements from each text (Company Class, Location, Founding Date, Company Status, etc.) and put them in a structured CSV.

There are two issues with the texts. 1. Not all cells have the same text information. For instance, some don't have "Location" (it's not that the element "Location" has an empty value in the text, it's simply not there). If the text doesn't contain "Location" as a possibility, I'd want that field left empty in the extracted CSV. 2. Some texts have the field more than once. For example, some companies have two or more locations. I'd like to extract all of them, the first time a location appears would be "Location" in the CSV and the second "Location2" and so on.

What's the best way to approach this? I don't have much experience working with texts in R so any help would be appreciated.

Edit: To clarify, in the final CSV, I need to create a single row per text with multiple column entries ("Company Name", "Location", etc.). Where the column name is not in the text, I'd like the column cell left blank.

Edit: Here is an example with two locations:

structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>Banking</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>NYC</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Atlanta</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>43448901</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>07/03/1997</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Solvent</span></td>", 
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
ec23
  • 3
  • 2
  • advise you read this and edit your question: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Eric Feb 07 '21 at 14:35
  • @ec23 - Do you have an example of how you expect the output to look? – C Jeruzal Feb 07 '21 at 18:55

2 Answers2

0

Since the source code is html you can use rvest package:

Here is an example from your submitted code:

html = "<H5>Company Information</H5>

</td>

</tr>

</table>

<table>

<tr>

<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>

</tr>

<tr>

<td><span>Location:</span></td><td><span>Chicago</span></td>

</tr>

<tr>

<td><span>Company Number:</span></td><td><span>70956655</span></td>

</tr>

<tr>

<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>

</tr>

<tr>

<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>"

library(rvest)
#> Loading required package: xml2
x = read_html(html)
tble = html_table(x)
tble
#> [[1]]
#>                X1         X2
#> 1  Company Class: E-Commerce
#> 2       Location:    Chicago
#> 3 Company Number:   70956655
#> 4  Founding Date: 01/09/2015
#> 5 Company Status:   Bankrupt

Created on 2021-02-07 by the reprex package (v0.3.0)

After that extracting values from the table that you require should be easy.

monte
  • 1,482
  • 1
  • 10
  • 26
0

Here's an approach based on the package stringr:

library(stringr)
df <- data.frame(
  desc = unlist(str_extract_all(x, "(?<=>)[\\w\\s]+(?=:<)")),
  detail = unlist(str_extract_all(x, "(?<=span>)[\\w\\s/-]+(?!:)(?=</span>)"))
)

Result:

df
            desc     detail
1  Company Class E-Commerce
2       Location    Chicago
3 Company Number   70956655
4  Founding Date 01/09/2015
5 Company Status   Bankrupt

EDIT:

If you want to have the details each in an extra column, then this should work (where the column name is not in the text, you'll get NA):

df <- data.frame(
  Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
  Location = str_extract(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
  Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
  Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
  Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)

Result:

df
  Company_Class Location Company_Number Founding_Date Company_Status
1    E-Commerce  Chicago       70956655    01/09/2015       Bankrupt

EDIT 2:

If you may have two Location values:

df <- data.frame(
  Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
  Location1 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[1],
  Location2 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[2],
  Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
  Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
  Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)

Result:

df
  Company_Class Location1 Location2 Company_Number Founding_Date Company_Status
1       Banking       NYC   Atlanta       43448901    07/03/1997        Solvent

Data:

x <- structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>", 
                                NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
                                ))
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Thanks for the help. This extracts the text elements I need, but not in the format I'm hoping for. My question might not have been entirely clear. Since the "desc" (Company Class, Location, etc.) are mostly common for all texts, I don't need to extract those and create multiple rows for each text. I need them to be the column names and I need a single row per text containing all the available elements. Is it best to tell R to extract string after "Company Class" and string after "Location" etc. and fill in the relevant column or leave blank where the column name is not in the text? – ec23 Feb 07 '21 at 15:33
  • This is great. One last question: is there a way for R to identify the first time a name appears? E.g. the first time "Location" appears in the text, the string would be placed in column "Location1", the second time, it would be placed in "Location2". – ec23 Feb 07 '21 at 16:55
  • Are you suggesting that there may be two "Location" `detail`s in the same string? If so, can you provide an example string? – Chris Ruehlemann Feb 07 '21 at 16:57
  • Just added an example. – ec23 Feb 07 '21 at 17:29
  • Have edited the answer yet again. Please consider upvoting and/or accepting the answer. – Chris Ruehlemann Feb 07 '21 at 17:44