So the problem is that given the below input, I would like to separate the URLs (that starts with either [URL or [LINK or [WEBSITE) and the text. I would like to put every URL in order into a list and every text into a text.
I also would like to combine all of the text into one line, so that every link matches with its corresponding text. Below is an example.
[URL - https://url1.com]
news_line1 word
news_line2 word word
news_line3 word word word
[LINK - https://url2.com]
headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter
[WEBSITE - https://url3.com]
date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence
output would be Links:
[URL - https://url1.com]
[LINK - https://url2.com]
[WEBSITE - https://url3.com]
and Text:
news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence
The current code I have is
import sys
inFile = sys.argv[1]
with open(inFile) as f:
content = f.readlines()
content = [x.strip() for x in content]
url_links = []
sentences = []
for entry in content:
sentence = ""
if entry.startswith(("[URL", "[LINK", "[WEBSITE")):
url_links.append(entry)
else:
sentence = sentence + entry
sentences.append(sentence)
for sentence in sentences:
print(sentence)
And the current output I have is
news_line1 word
news_line2 word word
news_line3 word word word
headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter
date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence
How can I tweak it such that it gives me the correct output?
Again, the desired output is
news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence