11

I have a list of url's and headers from a newspaper site in my country. As a general example:

x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1']

Each URL element has a corresponding sequence of 'news' elements, which can differ in length. In the example above, URL1 has 3 corresponding news and URL3 has only one.

Sometimes a URL has no corresponding "news" element:

y = ['URL4','news1','news2','URL5','URL6','news1']

I can easily find every URL index and the "news" elements of each URL.

My question is: Is it possible to transform this list into a dictionary in which the URL element is the key and the "news" elements are a list/tuple-value?

Expected Output

z = {'URL1':('news1', 'news2', 'news3'),
     'URL2':('news1', 'news2'),
     'URL3':('news1'),
     'URL4':('news1', 'news2'),
     'URL5':(),
     'URL6':('news1')}

I've seen a similar question in this post, but it doesn't solve my problem.

MrCorote
  • 565
  • 8
  • 21
  • 3
    Please include the code you wrote that does not produce the desired output. – dfundako Aug 15 '19 at 16:17
  • It's possible, but there probably isn't anything particular elegant like `dict(foo(bar(baz(x))))` for some set of functions `foo`, `bar`, and `baz`. – chepner Aug 15 '19 at 16:23
  • Are you generating `x`? If so, there must be a better way to do it. – DeepSpace Aug 15 '19 at 16:24
  • @DeepSpace I'm scrapping a web-site using Selenium and i though that using list in this way were easier to work with. But it isn't. – MrCorote Aug 15 '19 at 16:27

5 Answers5

12

You can do it like this:

>>> y = ['URL4','news1','news2','URL5','URL6','news1']
>>> result = {}
>>> current_url = None
>>> for entry in y:
...     if entry.startswith('URL'):
...         current_url = entry
...         result[current_url] = ()
...     else:
...         result[current_url] += (entry, )
...         
>>> result
{'URL4': ('news1', 'news2'), 'URL5': (), 'URL6': ('news1',)}
ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • 1
    You can use `deafultdict(list)` to save at least 4 lines. I'm not sure why you opted to use tuples if you knew you needed to add new items – DeepSpace Aug 15 '19 at 16:35
  • 1
    @DeepSpace, the OP wanted tuples, so here they are! I was using lists originally but then edited the code to use tuples. As for `defaultdict` - absolutely; I'm always forgetting about it. – ForceBru Aug 15 '19 at 16:38
  • @DeepSpace I see your point about using tuples. I rearranged ForceBru's answer to fit arrays instead. – MrCorote Aug 15 '19 at 16:44
4

You can use itertools.groupby with a key function to identify a URL:

from itertools import groupby
def _key(url):
    return url.startswith("URL") #in the body of _key, write code to identify a URL

data = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1', 'URL4','news1','news2','URL5','URL6','news1']
new_d = [list(b) for _, b in groupby(data, key=_key)]
grouped = [[new_d[i], tuple(new_d[i+1])] for i in range(0, len(new_d), 2)]
result = dict([i for [*c, a], b in grouped for i in [(i, ()) for i in c]+[(a, b)]])

Output:

{
 'URL1': ('news1', 'news2', 'news3'), 
 'URL2': ('news1', 'news2'), 
 'URL3': ('news1',), 
 'URL4': ('news1', 'news2'), 
 'URL5': (), 
 'URL6': ('news1',)
}
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
4

You can just use the indices of the URL keys in the list and grab what is between the indices and assign to the first

Like this:

x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1']
urls = [x.index(y) for y in x if 'URL' in y]
adict = {}
for i in range(0, len(urls)):
    if i == len(urls)-1:
        adict[x[urls[i]]] = x[urls[i]+1:len(x)]
    else:
        adict[x[urls[i]]] = x[urls[i]+1:urls[i+1]]
print(adict)

output:

{'URL1': ['news1', 'news2', 'news3'], 'URL2': ['news1', 'news2'], 'URL3': ['news1']}
Anna Nevison
  • 2,709
  • 6
  • 21
  • `if 'URL' in y` would also be `True` for a string like `http://mytinyURL.com`, which is not what you want. – jjramsey Aug 15 '19 at 16:33
  • @jjramsey but that's not in his list. – Anna Nevison Aug 15 '19 at 16:35
  • True, but the items `'news1'`, `'news2'`, etc. are clearly placeholders for items that may contain nearly arbitrary text, including a string containing the characters `'URL'`. – jjramsey Aug 15 '19 at 16:39
  • @jjramsey he can obviously modify it based on his use case. That's like saying this specific code wouldn't work if 'Cat' was one of the URLs-it's arbitrary. the idea is to use the indices and find something unique about the urls to be able to get them. – Anna Nevison Aug 15 '19 at 16:41
  • 1
    @jjramsey I should have pointed out that the ```news``` items don't have ```'URL'``` in them. – MrCorote Aug 15 '19 at 16:54
  • @PedroHenrique you're fine, you already said in your post that you can easily find the indices of the news and url items so that's what I went off of. – Anna Nevison Aug 15 '19 at 17:00
3

The more-itertools library contains a function split_before() which comes in very handy for this purpose:

{s[0]: tuple(s[1:]) for s in mt.split_before(x, lambda e: e.startswith('URL'))}

I think this is cleaner than any of the other approaches in answers posted before this one, but it does introduce an external dependency (unless you reimplement the function), which makes it not appropriate for every situation.

If your actual use case involves real URLs or something else, rather than strings of the form URL#, then just replace lambda e: e.startswith('URL') with whatever function you can use to select the key elements apart from the value elements.

David Z
  • 128,184
  • 27
  • 255
  • 279
2

Another solution using groupby, one-liner:

x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1', 'URL4','news1','news2','URL5','URL6','news1']

from itertools import groupby

out = {k: tuple(v) for _, (k, *v) in groupby(x, lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.startswith('URL') else (None, d['g']))}

from pprint import pprint
pprint(out)

Prints:

{'URL1': ('news1', 'news2', 'news3'),
 'URL2': ('news1', 'news2'),
 'URL3': ('news1',),
 'URL4': ('news1', 'news2'),
 'URL5': (),
 'URL6': ('news1',)}
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91