4

Here is my text file, which consists of a tuple on each line:

(1, 2)
(3, 4)
(5, 6)

What's the most both rough and optimized perspective to read above file and generate a list like below structure:

[[1,2],[3,4],[5,6]]

Here is my current approach, is not which truly what I want:

with open("agentListFile.txt") as f:
        agentList = [agentList.rstrip('\n') for line in f.readlines()]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
User
  • 952
  • 2
  • 21
  • 43
  • @Alik: Please check the edit for my current method... – User May 15 '15 at 07:41
  • 3
    What do you mean "rough" and "optimized"? If you want optimized, provide more details. Details allow people to cheat quite a bit and achieve greater optimization in most cases. Are all the tuples pairs of ints? How big is the file? – Shashank May 15 '15 at 07:49
  • @Shashank: The list consists of about 1000 2-member sublists and the list comprehension had not intrinsic suitable performance for me... but Jon's asnwer and using the `literal_eval()` method did something noticeable in the way of the improving the performance... – User May 15 '15 at 08:01
  • @Ordenador Are the members all ints or can they be anything? – Shashank May 15 '15 at 08:09
  • @Shashank: they are just integers in the range of 1 up to 100... – User May 15 '15 at 08:16

3 Answers3

3

You can use ast.literal_eval to safely evaluate the tuple and convert those tuples into a list inside a list-comp, eg:

import ast
with open("agentListFile.txt") as f:
    agent_list = [list(ast.literal_eval(line)) for line in f]

for more information, read the doc of ast.literal_eval, and this thread.

Community
  • 1
  • 1
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
3

This is the fastest solution I've been able to come up with so far.

def re_sol1():
    ''' re.findall on whole file w/ capture groups '''
    with open('agentListFile.txt') as f:
        numpairs = [[int(numstr)
            for numstr in numpair]
            for numpair in re.findall(r'(\d+), (\d+)', f.read())]
        return numpairs

It makes use of re.findall and the fact that all values are only positive integers. By using capture groups in the regular expression in combination with re.findall, you can efficiently grab pairs of positive integer strings and map them to integers in a list comprehension

To handle negative integers as well, you can use r'-?\d+' as your regular expression instead.


When I run the following code on Python 2.7.6 default version for Linux, it seems to show that re_sol1 is the fastest:

with open('agentListFile.txt', 'w') as f:
    for tup in zip(range(1, 1001), range(1, 1001)):
        f.write('{}\n'.format(tup))

funcs = []
def test(func):
    funcs.append(func)
    return func

import re, ast

@test
def re_sol1():
    ''' re.findall on whole file w/ capture groups '''
    with open('agentListFile.txt') as f:
        numpairs = [[int(numstr)
            for numstr in numpair]
            for numpair in re.findall(r'(\d+), (\d+)', f.read())]
        return numpairs

@test
def re_sol2():
    ''' naive re.findall on whole file '''
    with open('agentListFile.txt') as f:
        nums = [int(numstr) for numstr in re.findall(r'\d+', f.read())]
        numpairs = [nums[i:i+2] for i in range(0, len(nums), 2)]
        return numpairs

@test
def re_sol3():
    ''' re.findall on whole file w/ str.split '''
    with open('agentListFile.txt') as f:
        numpairs = [[int(numstr) 
            for numstr in numpair.split(', ')] 
            for numpair in re.findall(r'\d+, \d+', f.read())]
        return numpairs

@test
def re_sol4():
    ''' re.finditer on whole file '''
    with open('agentListFile.txt') as f:
        match_iterator = re.finditer(r'(\d+), (\d+)', f.read())
        numpairs = [[int(ns) for ns in m.groups()] for m in match_iterator]
        return numpairs

@test
def re_sol5():
    ''' re.match line by line '''
    with open('agentListFile.txt') as f:
        numpairs = [[int(ns) 
            for ns in re.match(r'\((\d+), (\d+)', line).groups()] 
            for line in f]
        return numpairs

@test
def re_sol6():
    ''' re.search line by line '''
    with open('agentListFile.txt') as f:
        numpairs = [[int(ns) 
            for ns in re.search(r'(\d+), (\d+)', line).groups()] 
            for line in f]
        return numpairs

@test
def sss_sol1():
    ''' strip, slice, split line by line '''
    with open("agentListFile.txt") as f:
        agentList = [map(int, line.strip()[1:-1].split(', ')) for line in f]
        return agentList

@test
def ast_sol1():
    ''' ast.literal_eval line by line '''
    with open("agentListFile.txt") as f:
        agent_list = [list(ast.literal_eval(line)) for line in f]
        return agent_list

### Begin tests ###

def all_equal(iterable):
    try:
        iterator = iter(iterable)
        first = next(iterator)
        return all(first == rest for rest in iterator)
    except StopIteration:
        return True

if all_equal(func() for func in funcs):
    from timeit import Timer

    def print_timeit(func, cnfg={'number': 1000}):
        print('{}{}'.format(Timer(func).timeit(**cnfg), func.__doc__))

    for func in funcs:
        print_timeit(func)
else:
    print('At least one of the solutions is incorrect.')

Sample output from a single run:

1.50156712532 re.findall on whole file w/ capture groups 
1.53699707985 naive re.findall on whole file 
1.71362090111 re.findall on whole file w/ str.split 
1.97333717346 re.finditer on whole file 
3.36241197586 re.match line by line 
3.59856200218 re.search line by line 
1.71777415276 strip, slice, split line by line 
12.8218641281 ast.literal_eval line by line 
Shashank
  • 13,713
  • 5
  • 37
  • 63
  • @Shahank: Putting aside the point that it is considerably more complicated than the candidate solution and has ruined the readability, but the performance sounds got better... Thank you... – User May 15 '15 at 08:51
  • @Ordenador Yes you're right, I guess I could make a different version that goes through the file line by line...but that is actually computationally harder for computers due to i/o and string-parsing bottlenecks and I was aiming for performance... – Shashank May 15 '15 at 09:16
  • All in all, I am grateful for your contribution. – User May 15 '15 at 09:44
  • @Ordenador note that you could also re-factor this answer to do line by line and make the matching more explicit, eg: `agent_list = list(([int(i) for i in re.match(r'\((\d+), (\d+)\)', line).groups()] for line in f))` – Jon Clements May 15 '15 at 10:12
2

The code below relies on assumption, that your lines follow the same format (number1, number2)

def strip_slice_split_solution():
    with open("agentListFile.txt") as f:
        agentList = [map(int, line.strip()[1:-1].split(', ')) for line in f]
        return agentList    

s[1:-1] will omit first and last characters (brackets) of s.

I put Shashank's solution (removed import from the function) and Jon's solution and mine into a file and decided to do a few tests. I generated a few files with 5000-1000 lines in them to do tests.

Excerpt from test

In [3]: %timeit re_solution()
100 loops, best of 3: 2.3 ms per loop

In [4]: %timeit strip_slice_split_solution()
100 loops, best of 3: 2.28 ms per loop

In [5]: %timeit ast_solution()
100 loops, best of 3: 14.1 ms per loop

All 3 functions produce the same result

In [6]: ast_solution() == re_solution() == strip_slice_split_solution()
Out[6]: True
Community
  • 1
  • 1
Konstantin
  • 24,271
  • 5
  • 48
  • 65
  • Thank you, but I have an `AttributeError` that: `list object has no attribute 'rstrip'`... – User May 15 '15 at 07:48
  • @Ordenador i've copied the code from your question and added a few calls on top of it. Fixed it. – Konstantin May 15 '15 at 07:50
  • Those are strange results because I'm doing `Timer(func).timeit(number=1000)` and I get that my solution is always faster than yours on Python 2.7.6 default for linux. :p – Shashank May 15 '15 at 17:18
  • @Shashank is it significantly faster? – Konstantin May 15 '15 at 17:20
  • Depends on what you consider significant. With `number=10000` I'm able to get 14.86 seconds for `re_solution`, and 17.81 for the `strip_slice_split_solution`. It's not super significant, I suppose, but I definitely don't think my solution is slower. – Shashank May 15 '15 at 17:26
  • @Shashank I increased amount of records as well as amount of loops. Seems, that your function is faster by 5%-10% for me. Anyway I am surprised, that OP decided to go with `ast` based solution. – Konstantin May 15 '15 at 17:39