I have a one-column CSV. Each cell has a large text that's the source code from a page in a database of company descriptions (I'm only including a small portion of the text, it's more extensive.) There are about 30,000 cells/companies in the CSV.
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
I'd like to extract specific elements from each text (Company Class, Location, Founding Date, Company Status, etc.) and put them in a structured CSV.
There are two issues with the texts. 1. Not all cells have the same text information. For instance, some don't have "Location" (it's not that the element "Location" has an empty value in the text, it's simply not there). If the text doesn't contain "Location" as a possibility, I'd want that field left empty in the extracted CSV. 2. Some texts have the field more than once. For example, some companies have two or more locations. I'd like to extract all of them, the first time a location appears would be "Location" in the CSV and the second "Location2" and so on.
What's the best way to approach this? I don't have much experience working with texts in R so any help would be appreciated.
Edit: To clarify, in the final CSV, I need to create a single row per text with multiple column entries ("Company Name", "Location", etc.). Where the column name is not in the text, I'd like the column cell left blank.
Edit: Here is an example with two locations:
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>Banking</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>NYC</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Atlanta</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>43448901</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>07/03/1997</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Solvent</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))