Text file processing in python

Question

I have the following file:

1    real madrid,barcelona,chelsea,arsenal,cska
2    chelsea,arsenal,milan,napoli,juventus
5    bayern,dortmund,celtic,napoli
7    cska,psg,arsenal,atalanta
9    atletic bilbao,las palmas,milan,barcelona

and I want to produce a new file with this output (where I had the nodes, now I have every team and in the second column I have the nodes that has this team as attribute):

real madrid    1
barcelona    1,9
chelsea    1,2
arsenal    1,2,7
cska    1,7
milan    2,9
etc...

First of all i opened the file and I saved each column to a list:

file1 = open("myfile.txt","r")
lines1 = file1.readlines()
nodes1 = []
attrs1 = []


for x in lines1:
    x = x.strip()
    x = x.split('\t')
    nodes1.append(x[0])
    attrs1.append(x[1].split(','))

but now how can I check the attrs and nodes to produce the output file?

Petr Blahos · Answer 1 · 2018-04-06T18:42:58.033

4

Better, create a dictionary when reading the file:

line_map = {}
for x in lines1:
    (row_no, teams) = x.strip().split("\t")
    for i in teams.split(","):
        if not i in line_map:
            line_map[i] = set()
        line_map[i].add(row_no)

Now line_map contains a mapping of the team name to a list of lines it is contained on. You can easily print that:

for (k, v) in line_map.items():
    print("%s: %s" % (k, ",".join(v)))

if I am not much mistaken...

Edit: append should have been add.

edited Apr 06 '18 at 18:42

answered Apr 06 '18 at 14:27

Petr Blahos

2,253
1
11
14

Only thing I would add is to use a `defaultdict` with string keys and set values to avoid the check whether `i` is in `line_map` already. – Tony Tuttle Apr 06 '18 at 14:30
2

`line_map[i]` is a set, therefore use `.add()` – radzak Apr 06 '18 at 14:34

zwer · Accepted Answer · 2018-04-06T14:56:03.080

You can create a dictionary to hold your teams and populate it with nodes as you encounter them:

import collections

teams = collections.defaultdict(set)  # initiate each team with a set for nodes
with open("myfile.txt", "r") as f:  # open the file for reading
    for line in f:  # read the file line by line
        row = line.strip().split("\t")  # assuming a tab separator as in your code
        if not row:  # just a precaution for empty lines
            continue
        for team in row[1].split(","):  # split and iterate over each team
            teams[team].add(row[0].strip())  # add a node to the current team

# and you can now print it out:
for team, nodes in teams.items():
    print("{}\t{}".format(team, ",".join(nodes)))

This will yield:

arsenal    2,1,7
atalanta    7
chelsea 2,1
cska    1,7
psg 7
juventus    2
real madrid 1
barcelona   9,1
dortmund    5
celtic  5
napoli  2,5
milan   9,2
las palmas  9
atletic bilbao  9
bayern  5

For your data. Order is not guaranteed, tho, but you can always apply sorted() to get them in the order you want.

UPDATE: To save the result into a file all you need is to use handle.write():

with open("out_file.txt", "w") as f:  # open the file for writing
    for team, nodes in teams.items():  # iterate through the collected team-node pairs
        f.write("{}\t{}\n".format(team, ",".join(nodes)))  # write each as a new line

your code works fine, but when i try to save it in a text file, it saves all in one line. How can i save with the same format that is printed? — Lee Yaan, Apr 06 '18 at 14:49

snakes_on_a_keyboard · Answer 3 · 2018-04-06T15:52:47.293

Here's an approach(?) using regular expressions. Happy coding :)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') 
              for line in io.StringIO(open('f.txt').read())]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

Explanation (of sorts)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    # ('\d*') <-- match and capture leading integers
    # '\s*' <---- match but don't capture intervening space
    # ('.*') <--- match and capture the everything else

    # ('\g<2>|\g<1>') <--- swaps the second capture group with the first
    #                      and puts a "|" in between for easy splitting

    # io.StringIO is a great wrapper for a string, makes it easy to process text

    # re.subn is used to perform the regex swapping
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') for line in io.StringIO(open('f.txt').read())]

    # convert [[place1,place2 1], [place3,place4, 2] ...] -> [[place1, 1], place2, 1], [place3, 2], [place4, 2] ...]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    # group together, extract numbers, ...?, profit!
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

Bonus: one line "piss off your coworkers" edition

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    groups = [[place, lines]
              for a, b in itertools.groupby(sorted([[word, n]
              for line in io.StringIO(open('f.txt').read())
              for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
              for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
              for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

    for place, lines in groups:
        print(place, lines)

"Bonus" #2: write output directly to file, piss-off-co-worker-no-life edition v1.2

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    with open('output.txt', 'w') as f:
        groups = [print(place, lines, file=f)
                  for a, b in itertools.groupby(sorted([[word, n]
                  for line in io.StringIO(open('f.txt').read())
                  for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
                  for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
                  for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

"Bonus" #3: terminal-tables-because-I-got-fired-for-pissing-off-my-coworkers-so-I-have-free-time-edition v75.2

Note: requires terminaltables 3rd party library

#!/usr/bin/env python3.6
import io
import itertools
import re
import terminaltables

if __name__ == '__main__':
    print(terminaltables.AsciiTable(
        [['Places', 'Line No.'], *[[place, lines]
          for a, b in itertools.groupby(sorted([[word, n]
          for line in io.StringIO(open('f.txt').read())
          for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
          for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
          for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]]).table)

output

+----------------+----------+
| Places         | Line No. |
+----------------+----------+
| arsenal        | 1,2,7    |
| atalanta       | 7        |
| atletic bilbao | 9        |
| barcelona      | 1,9      |
| bayern         | 5        |
| celtic         | 5        |
| chelsea        | 1,2      |
| cska           | 1,7      |
| dortmund       | 5        |
| juventus       | 2        |
| las palmas     | 9        |
| milan          | 2,9      |
| napoli         | 2,5      |
| psg            | 7        |
| real madrid    | 1        |
+----------------+----------+

handle · Answer 4 · 2018-04-07T08:10:58.253

# for this example, instead of reading a file just include the contents as string ..
file1 = """
1\treal madrid,barcelona,chelsea,arsenal,cska
2\tchelsea,arsenal,milan,napoli,juventus
5\tbayern,dortmund,celtic,napoli
7\tcska,psg,arsenal,atalanta
9\tatletic bilbao,las palmas,milan,barcelona
"""

# .. which can be split into a list (same result as with readlines)
lines1 = file1.strip().split('\n')
print(lines1)

# using separate lists requires handling indexes, so I'd use a dictionary instead
output_dict = {}

# iterate as before
for x in lines1:
    # you can chain the methods, and assign both parts of the line 
    # simultaneously (must be two parts exactly, so one TAB, or there
    # will be an error (Exception))
    node, attrs = x.strip().split('\t')

    # separate the list of clubs
    clubs = attrs.split(',')

    # collect each club in the output ..
    for club in clubs:
        # and with it, a list of the node(s)
        if club in output_dict:
            # add entry to the list for the existing club
            output_dict[club].append(node)
        else:
            # insert the club with a new list containing the first entry
            output_dict[club] = [node]

    # that should be it, let's see ..

# iterate the dict(ionary)
for club in output_dict:
    # convert list of node(s) to a string by joining the elements with a comma
    nodestr = ','.join(output_dict[club])

    # create a formatted string with the club and its nodes
    clubstr = "{:20}\t{}".format(club, nodestr)

    # print to stdout (e.g. console)
    print( clubstr )

prints

['1\treal madrid,barcelona,chelsea,arsenal,cska', '2\tchelsea,arsenal,milan,napoli,juventus', '5\tbayern,dortmund,celtic,napoli', '7\tcska,psg,arsenal,atalanta', '9\tatletic bilbao,las palmas,milan,barcelona']
real madrid             1
barcelona               1,9
chelsea                 1,2
arsenal                 1,2,7
cska                    1,7
milan                   2,9
napoli                  2,5
juventus                2
bayern                  5
dortmund                5
celtic                  5
psg                     7
atalanta                7
atletic bilbao          9
las palmas              9

score 0 · Answer 5 · answered Apr 06 '18 at 17:03

Here is a solution with pandas (why not)

import pandas as pd
path_file_input = 'path\to\input_file.txt'
path_file_output = 'path\to\output_file.txt'

# Read the data from a txt file (with a tab separating the columns)
data = pd.read_csv(path_file_input, sep ='\t', header=None, names=[ 'Nodes', 'List Teams'], dtype=str)
# Create a column with all couple team-node
data_split = data['List Teams'].str.split(',', expand=True).stack().reset_index(level=0)\
                .set_index('level_0').rename(columns={0:'Teams'}).join(data.drop('List Teams',1), how='left')             
# Merge the data per team and join the nodes
data_merged = data_split.groupby('Teams')['Nodes'].apply(','.join).reset_index()

# Save as a txt file
data_merged.to_csv(path_file_output, sep='\t', index=False, header=False, float_format = str)
# or display the data
print (data_merged.to_csv(sep='\t', header=False, index=False))

see normalizing data by duplication for a really good explanation of the line starting by data_split

Text file processing in python

5 Answers5

Explanation (of sorts)

Bonus: one line "piss off your coworkers" edition

"Bonus" #2: write output directly to file, piss-off-co-worker-no-life edition v1.2

"Bonus" #3: terminal-tables-because-I-got-fired-for-pissing-off-my-coworkers-so-I-have-free-time-edition v75.2