1

I have a text file that I'm extracting text from using its punctuation and indentation patterns. The output should be a list of lists combining two lists; company_name and description

[[company,description],[company,description]]

To do that I'm running a while loop nested within a for loop to extract the description for each company. Here's my code

for line in file:
if not re.search(r"            ", line, re.MULTILINE):
        name = line.split(',', 1)[0]
        companies.append(name)
        print(companies)
        companies = []
    while re.search(r"            ", line, re.MULTILINE):
        desc.append(line)
        print(desc)
        desc = []
        break 

Sample from text file:

XYZ Group, a nearly nine-year-old, Copenhagen-based company that has built a dual-purpose platform, providing both accountancy software and a marketplace for small and medium businesses to find accountants, has landed $73 million in growth funding from a single investor, Lugard Road Capital. TechCrunch has more here.

Black Lake, a nearly five-year-old, China-based software platform for factory workers to log their daily tasks and managers to oversee the plant floor, recently raised $77 million in funding, including from Singapore’s sovereign wealth fund Temasek, which led the round, as well as China Renaissance and Lightspeed Venture Partners. The outfit has now raised more than $100 million altogether, including from from GGV...

Actual text file with indentation pattern That's the output:

['XYZ Group']

['            company that has built a dual-purpose platform, providing both']
['            accountancy software and a marketplace for small and medium']
['            businesses to find accountants, has landed 73 million in growth funding from a single investor,']
['            Lugard Road Capital TechCrunch has more']
['            here']

['Black Lake']

['            platform for factory workers to log their daily tasks and managers']
['            to oversee the plant floor, recently raised 77 million in funding,']
['            including from Singapore’s sovereign wealth fund Temasek,']
['            which led the round, as well as China']
['            Renaissance and Lightspeed Venture']
['            Partners The outfit has now raised more than 100']
['            million altogether, including from from GGV']
['            Capital, Bertelsmann Asia Investments,']
['            GSR Ventures, ZhenFund']
['            and others TechCrunch has more']
['            here']

The goal is to join the output of desc list under company name into 1 list

Update

I put desc = [] outside of the while loop and I'm getting this:

['XYZ Group']
['            company that has built a dual-purpose platform, providing both']
['            company that has built a dual-purpose platform, providing both', '            accountancy software and a marketplace for small and medium']
['            company that has built a dual-purpose platform, providing both', '            accountancy software and a marketplace for small and medium', '            businesses to find accountants, has landed 73 million in growth funding from a single investor,']
['            company that has built a dual-purpose platform, providing both', '            accountancy software and a marketplace for small and medium', '            businesses to find accountants, has landed 73 million in growth funding from a single investor,', '            Lugard Road Capital TechCrunch has more']
['            company that has built a dual-purpose platform, providing both', '            accountancy software and a marketplace for small and medium', '            businesses to find accountants, has landed 73 million in growth funding from a single investor,', '            Lugard Road Capital TechCrunch has more', '            here']

I only need the last iteration though

Aly Khairy
  • 11
  • 2
  • Could you expand the code sample to be reproducible? What is line, flag and please fix the indentation. – Nevus Sep 15 '22 at 02:48
  • @Nevus I removed the flag because it's irrelevant to this code snippet. As for the indentations, that's how the original text file is formatted and that's the pattern I'm trying to follow in the code. I added a picture to my original post for reference – Aly Khairy Sep 15 '22 at 09:10
  • Welcome to Stack Overflow. "I put `desc = []` outside of the while loop and I'm getting this: ... I only need the last iteration though" - okay, so, did you try to check what `desc` contains after this code? Is it correct? Where the code says `print(desc)`, what exactly do you expect this to mean? How many times do you think it will run, and why? What will happen each time it runs? Do you see how this explains the output? – Karl Knechtel Sep 15 '22 at 12:30
  • @KarlKnechtel thanks, Karl! desc is the list where the company description is appended. The loop goes over each line and if it's indented it adds that line to the list, so technically the output is correct, but I only need the last iteration of that list to be printed out and appended to a list of lists [[company],[description]] – Aly Khairy Sep 15 '22 at 12:35
  • Right, so, think about the logic again. If "only the last iteration" should be printed, then what makes more sense: doing the printing inside the loop, or afterward? – Karl Knechtel Sep 15 '22 at 12:36
  • problem is I need to clear the desc list at the end of every while loop to append lines of the new paragraph. It would work if I only have one paragraph to append – Aly Khairy Sep 15 '22 at 12:57

1 Answers1

0

Assuming the text is always following a <company_name>, <description> pattern, a very simple approach based on .split(). Simply split on the first , by limiting the number of splits with maxsplit=1 to get the name and full_description which can be prettified afterwards:

text = "XYZ Group, a nearly nine-year-old, Copenhagen-based company that has built a dual-purpose platform, providing both accountancy software and a marketplace for small and medium businesses to find accountants, has landed $73 million in growth funding from a single investor, Lugard Road Capital. TechCrunch has more here."

name, full_description = text.split(',', 1)
description = [s.strip() for s in full_description.split(',')]

output = [name, description]
print(output)

Output:

['XYZ Group', ['a nearly nine-year-old', 'Copenhagen-based company that has built a dual-purpose platform', 'providing both accountancy software and a marketplace for small and medium businesses to find accountants', 'has landed $73 million in growth funding from a single investor', 'Lugard Road Capital. TechCrunch has more here.']]

Alternatively, you could also use .split(" ") to split on the multiple occurring spaces and ignore any commas.

albert
  • 8,027
  • 10
  • 48
  • 84
  • Thanks, Albert! unfortunately that only works with 1 company and description, but if we add a second company and description to that text string, company name is not recognized separately. – Aly Khairy Sep 15 '22 at 13:08
  • @AlyKhairy: If each company/description block is separated by an empty line you could get each block by doing some preprocessing with `.split("\n\n")` and use the approach from my code snippet for each of these company/description blocks – albert Sep 15 '22 at 13:14