0

So the problem is that given the below input, I would like to separate the URLs (that starts with either [URL or [LINK or [WEBSITE) and the text. I would like to put every URL in order into a list and every text into a text.

I also would like to combine all of the text into one line, so that every link matches with its corresponding text. Below is an example.

[URL - https://url1.com]
news_line1 word
news_line2 word word
news_line3 word word word

[LINK - https://url2.com]
headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter

[WEBSITE - https://url3.com]
date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence

output would be Links:

[URL - https://url1.com]
[LINK - https://url2.com]
[WEBSITE - https://url3.com]

and Text:

news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence

The current code I have is

import sys

inFile = sys.argv[1]

with open(inFile) as f:
    content = f.readlines()

content = [x.strip() for x in content]
url_links = []
sentences = []

for entry in content:
    sentence = ""
    if entry.startswith(("[URL", "[LINK", "[WEBSITE")):
        url_links.append(entry)

    else:
        sentence = sentence + entry

    sentences.append(sentence)

for sentence in sentences:
    print(sentence)

And the current output I have is


news_line1 word
news_line2 word word
news_line3 word word word


headline_line1 letter
headline_line2 letter letter
headline_line3 letter letter letter


date_line1 sentence
date_line2 sentence sentence
date_line3 sentence sentence sentence

How can I tweak it such that it gives me the correct output?

Again, the desired output is

news_line1 word news_line2 word word news_line3 word word word
headline_line1 letter headline_line2 letter letter headline_line3 letter letter letter
date_line1 sentence date_line2 sentence sentence date_line3 sentence sentence sentence
Joey Joestar
  • 205
  • 1
  • 11
  • In case it's not obvious, this is why we prefer standard data formats like JSON and YAML over poorly-specified ad-hoc formats where you have to write your own parser. – tripleee Apr 21 '21 at 04:39

1 Answers1

0

Instead of concatenating strings to a variable, you can append an empty string into sentences everytime you get a "[URL" "[WEBSITE" "[LINK". And make all text appends to last sentence of sentences.

import sys

inFile = sys.argv[1]

with open(inFile) as f:
    content = f.readlines()

content = [x.strip() for x in content]
url_links = []
sentences = []

for entry in content:
    if entry.startswith(("[URL", "[LINK", "[WEBSITE")):
        url_links.append(entry)
        sentences.append("")

    else:
        sentences[-1] += entry


for sentence in sentences:
    print(sentence)

Here, I am concatenating strings using "+" however according to your requirements and python version there maybe faster alternatives to it.

Which is the preferred way to concatenate a string in Python?

Anubhav Gupta
  • 421
  • 5
  • 12
  • Does your input also contain sentences above the first URL? – Anubhav Gupta Apr 21 '21 at 02:48
  • This works great! However if you look at the output ```news_line1 wordnews_line2 word wordnews_line3 word word word``` you can notice that ```wordnews``` has been put together when there should be an space. Same with ```wordnews_line3``` – Joey Joestar Apr 21 '21 at 02:48
  • 1
    You can make a small change like, `sentences[-1] += " " + entry`, to add/remove characters according to your precise requirements – Anubhav Gupta Apr 21 '21 at 02:50
  • Awesome! Thank you so much! I will accept your answer when it lets me – Joey Joestar Apr 21 '21 at 02:51
  • 1
    Glad to learn it helped you! At the most basic level you are just concatenating strings, and there are many ways to do so efficiently, so take a look here https://stackoverflow.com/questions/12169839/which-is-the-preferred-way-to-concatenate-a-string-in-python – Anubhav Gupta Apr 21 '21 at 02:52