3

I'm trying to to sort through a file line by line, comparing the beginning with a string from a list, like so:

for line in lines:
    skip_line = True
    for tag in tags:
        if line.startswith(tag) is False:
            continue
        else:
            skip_line = False
            break
    if skip_line is False:
        #do stuff

While the code works just fine, I'm wondering if there's a neater way to check for this condition. I have looked at any(), but it seems to just give me the possibility to check if any of my lines start with a fixed tag (not eliminating the for loop needed to loop through my list.

So, essentially I'm asking this:
Is there a better, sleeker option than using a for loop to iterate over my tags list to check if the current line starts with one of its elements?

As Paradox pointed out in his answer: Using a dictionary to lookup if the string exists has O(1) complexity and actually makes the entire code look a lot cleaner, while being faster than looping through a list. Like so:

tags = {'ticker':0, 'orderBook':0, 'tradeHistory':0}
for line in lines:
    if line.split('\t')[0] in tags:
        #do stuff
deepbrook
  • 2,523
  • 4
  • 28
  • 49
  • You're comparing each of multiple strings to each of multiple strings. You'll have to use loops, or call something that uses loops internally (like `any()`). – TigerhawkT3 Dec 02 '15 at 08:26
  • By any chance, are all the tags the same length? Or are tags delimited from the rest of the line in some way, that would allow you to pick out the part of `line` that _might_ correspond to a tag before going through the list of tags? Alternatively, is it practical to load all the lines into memory at once, or do you really want to iterate through the file line by line? – David Z Dec 02 '15 at 08:27
  • They're seperated by tabs, the tags are not the same length and yes - I need to go over them line by line; I'm looking for json strings which are marked by one of three tags (ticker, orderBook, or transactionHistory), each json strin being then packed into separate files as they occur. – deepbrook Dec 02 '15 at 08:36
  • 1
    You may find a set to be a more natural data structure than a dict; https://docs.python.org/3/tutorial/datastructures.html#sets – ymbirtt Dec 02 '15 at 09:05

5 Answers5

2

If you're determined to pull this down into a one-liner, you can use a generator:

tagged_lines = (line for line in lines if any(line.startswith(tag) for tag in tags))
for line in tagged_lines:
    # Do something with line here 

Of course, how readable this is is a different question.

You've probably seen syntax like [x*x for x in range(10)] before, but by swapping the [] for (), we instead generate each item only when it's asked for.

ymbirtt
  • 1,481
  • 2
  • 13
  • 24
1

Instead of iterating over your tags list, you can put all your tags inside a HashMap and do a simple lookup like myMap.exists("word"). This would be much faster that iterating through your tags list and works in O(1) complexity. In python its actually a dictionary data structure. http://progzoo.net/wiki/Python:Hash_Maps

paradox
  • 377
  • 3
  • 12
  • I've added an example of how I solved this to my question - this was exactly what I was looking for. If you could, add it to your answer, for readability purposes and such. Posting it under here doesn't read so well. – deepbrook Dec 02 '15 at 08:57
1

In fact any() will do the job

Looping each line

for line in lines:
     tagged = any(lambda: line.startswith(y), tags)

Any list start with any tag

any(lambda x: any(lambda y: x.startswith(y), tags), lines)

Filter tagged lines

filter(lambda x: any(lambda y: x.startswith(y), tags), lines)
TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
Netwave
  • 40,134
  • 6
  • 50
  • 93
  • 1
    Should be `taged = any(line.startswith(y) for y in tags)` and `filter(lambda x: not any(x.startswith(y) for y in tags), lines)` else your code doesn't work. – Akshay Hazari Dec 02 '15 at 09:12
0

This has been asked before. Take a look at this post for more solutions. I would flag this post as a duplicate but I still do not have the reputation.

https://stackoverflow.com/a/10477481/5016492

You'll need to modify the regular expression so that it looks at the start of the line. Something like this should work for you '^tag' .

Community
  • 1
  • 1
0

How about a combination off any() and filter() like in this example:

# use your data here ...
mytags = ('hello', 'world')
mylines = ('hello friend', 'you are great', 'world is cruel')

result = filter(lambda line: any(map(lambda tag: line.startswith(tag), mytags)), mylines)
print result
B. Brosda
  • 1
  • 1