1

Apologies in advance for not fully understanding how to format things on this website perfectly. As such, the tabs in my examples aren't so pretty.

I need to read the contents of two .txt files, each line containing a keyword and integer value delimited by tabs, and sum the values of common keywords between the two files; and then order the keywords from highest to lowest associated values (with values listed) in a new text file.

Let's say I have two .txt files:

eggs    25
beans    10
peas    30
oranges    15
eggs    15
pineapples    45
beans    35
peas    25

My desired output would resemble:

peas    55
beans    45
pineapples    45
eggs    40
oranges    15

In case of two values being shared, I would like to order the keywords alphabetically.

What is the most efficient and Pythonic way of going about this?

2 Answers2

1

Firstly you have to read the files and then convert them to dataframe and add them

First Text File

with open(r'C:\Users\Arpit\Desktop\1.txt','r') as rf1:
    df1=rf1.readlines()

df1=pd.DataFrame(df1)
data=[df1[0].str.split()[i] for i in range(0,4)]
df1=pd.DataFrame(data,columns=['items','amount'])

Second txt file

with open(r'C:\Users\Arpit\Desktop\2.txt','r') as rf2:
    df2=rf2.readlines()

df2=pd.DataFrame(df2)
data=[df2[0].str.split()[i] for i in range(0,4)]
df2=pd.DataFrame(data,columns=['items','amount'])

Now you can simply add the data

    items={}
    for index1,i in enumerate(df1['items']):
       for index2,j in enumerate(df2['items']):
         if i==j:
            items[i]=int(df1['amount'][index1])+int(df2['amount'][index2])

    ## For First DataFrame
    for index,i in enumerate(df1['items']):
        if i not in items.keys():
            items[i]=df1['amount'][index]

   ## For second DataFrame
   for index,j in enumerate(df2['items']):
    if j not in items.keys():
        items[j]=df2['amount'][index]

  ## Finally making the final DataFrame
  df=pd.DataFrame(items.values(),index=items.keys()).reset_index()
  df.columns=['items','amount']
  df
Arpit
  • 394
  • 1
  • 11
  • Thank you for the answer. Could you elaborate more specifically on what this code is doing, namely ```enumerate()``` function? – Randall Ivan Carson Aug 12 '20 at 13:43
  • enumerate() is basically used to get the index along the value it is using at each iteration. And then I have created a dictionary in which firstly I am taking the duplicates and adding their values and adding it to the dictionary. Then I am checking for both dataframes non- duplicate values and adding those to the dictionary – Arpit Aug 12 '20 at 13:49
  • How could I export this data to a .txt file of the same tab-separated layout? – Randall Ivan Carson Aug 12 '20 at 14:18
  • Use `numpy.savetxt()` . Refer to this link - https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html – Arpit Aug 12 '20 at 14:23
  • This solution seems to only be listing the data for df1. – Randall Ivan Carson Aug 12 '20 at 14:26
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219685/discussion-between-randall-ivan-carson-and-arpit). – Randall Ivan Carson Aug 12 '20 at 14:31
1

Using csv and counter modules

Code

import csv
from collections import Counter

with open('f1.txt', 'r') as f1, open('f2.txt', 'r') as f2:
  # shown input has multiple spaces between fields
  reader1 = csv.reader(f1, delimiter=' ', skipinitialspace=True)
  reader2 = csv.reader(f2, delimiter=' ', skipinitialspace=True)

  # Use dictionary comprehension to
  # convert to dictionary
  #    converting second value in each row to int
  d1 = {x[0]:int(x[1]) for x in reader1}
  d2 = {x[0]:int(x[1]) for x in reader2}

# Use Counter to add common keys
cnts = Counter(d1) + Counter(d2)

# Sort by value descending and alphabeical ascending
result = dict(sorted(cnts.items(), key=lambda kv: (-kv[1], kv[0])))
for k, v in result.items():
  print(k, v)

Test

File1.txt

eggs    25
beans    10
peas    30
oranges    15

File2.txt

eggs    15
pineapples    45
beans    35
peas    25

Output

peas 55
beans 45
pineapples 45
eggs 40
oranges 15

Update

Code updated based upon actual data example

Issues

  • Posted code is multispace delimited
  • Actual data is tab delimited
  • Many rows in the real data (from comment) is not properly formatted as two-column fields
  • Created a function to go line by line through the data to retrieve only valid fields
  • Used data link in comment for files 1 & 2

Code Update

from collections import Counter

def get_data(filenm):
  '''
    Two column CSV tab delimited data
    vald - lines with valid data
    invalid - lines with invalid data (linenumber, data)
  '''
  valid, invalid = [], []
  with open(filenm, 'r', encoding="utf8") as f:
    for i, line in enumerate(f):
      row = line.rstrip().split('\t')
      if len(row) == 2:
        valid.append(row)  # Valid row of data
      else:
        invalid.append((i, line))

  return valid, invalid

valid1, invalid1 = get_data('agg1.txt')
valid2, invalid2 =get_data('agg2.txt')

# Convert Valid rows to dictionary
d1 = {x[0]:int(x[1]) for x in valid1}
d2 = {x[0]:int(x[1]) for x in valid2}

cnts = Counter(d1) + Counter(d2)
# Sort by value descending and alphabeical ascending
result = dict(sorted(cnts.items(), key=lambda kv: (-kv[1], kv[0])))

# Show first 10 lines of results
print('First 10 lines of results')
for i, (k, v) in enumerate(result.items()):
  print(k, v)
  if i > 10:
    break

# Show invalid data (line number and line)
print()
print('Invalid file 1')
print(*invalid1, sep = '\n')
print('Invalid file 2')
print(*invalid2, sep = '\n')

Output Update

First 10 lines of results
。 6397586
を 4450628
《 2948712
》 2948688
「 2295146
」 2294570
… 1843528
だ 1530958
いる 841602
こと 761052
? 545826
する 458792

Invalid file 1
(5828, '\t\t\t946\n')
(24158, '133\n')
(24293, '132\n')
(30648, '87\n')
(37889, '58\n')
(46807, '37\n')
(51404, '\t\t\t30\n')
(53151, '27\n')
(54272, '26\n')
(54677, '25\n')
(55962, '24\n')
(57129, '23\n')
(70327, '13\n')
(71287, '12\n')
(73405, '11\n')
(76059, '10\n')
(76214, '10\n')
(82563, '8\n')
(83460, '8\n')
(85801, '7\n')
(88476, '6\n')
(88494, '6\n')
(94354, '5\n')
(94703, '5\n')
(97635, '4\n')
(110152, '3\n')
(110153, '3\n')
(110560, '3\n')
(111046, '3\n')
(117778, '2\n')
(117791, '2\n')
(117795, '\t\uf8f3\t2\n')
(117806, '2\n')
(118312, '2\n')
(119811, '2\n')
(119848, '2\n')
(134106, '1\n')
(134485, '1\n')
(134505, '1\n')
(136092, '1\n')
(136144, '1\n')
(136147, '1\n')
(139521, '1\n')
(139626, '1\n')
(139629, '1\n')
(139645, '1\n')
(139665, '1\n')
(139724, '1\n')
(139877, '1\n')
(139885, '1\n')
(139887, '1\n')
(139897, '1\n')
(139914, '1\n')
(139935, '1\n')
(139936, '1\n')
(139963, '1\n')
(139975, '1\n')
Invalid file 2
(5828, '\t\t\t946\n')
(24158, '133\n')
(24293, '132\n')
(30648, '87\n')
(37889, '58\n')
(46807, '37\n')
(51404, '\t\t\t30\n')
(53151, '27\n')
(54272, '26\n')
(54677, '25\n')
(55962, '24\n')
(57129, '23\n')
(70327, '13\n')
(71287, '12\n')
(73405, '11\n')
(76059, '10\n')
(76214, '10\n')
(82563, '8\n')
(83460, '8\n')
(85801, '7\n')
(88476, '6\n')
(88494, '6\n')
(94354, '5\n')
(94703, '5\n')
(97635, '4\n')
(110152, '3\n')
(110153, '3\n')
(110560, '3\n')
(111046, '3\n')
(117778, '2\n')
(117791, '2\n')
(117795, '\t\uf8f3\t2\n')
(117806, '2\n')
(118312, '2\n')
(119811, '2\n')
(119848, '2\n')
(134106, '1\n')
(134485, '1\n')
(134505, '1\n')
(136092, '1\n')
(136144, '1\n')
(136147, '1\n')
(139521, '1\n')
(139626, '1\n')
(139629, '1\n')
(139645, '1\n')
(139665, '1\n')
(139724, '1\n')
(139877, '1\n')
(139885, '1\n')
(139887, '1\n')
(139897, '1\n')
(139914, '1\n')
(139935, '1\n')
(139936, '1\n')
(139963, '1\n')
(139975, '1\n')
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • How could I use non-ASCII characters in a file using this method? I have both files opened in UTF-8, however the program still returns a UnicodeDecodeError at the line `d1 = {x[0]:int(x[1]) for x in reader1}`. – Randall Ivan Carson Aug 12 '20 at 15:08
  • @RandallIvanCarson--Will take a look. Could you either provide a link to your data example (with non-ascii) or update the data in [this example](https://repl.it/@DarrylGurganiou/sumtwocsv) i.e. f1.txt and f2.txt? – DarrylG Aug 12 '20 at 15:25
  • The files I am combining are two Japanese word use frequency reports of different sources, so I am apprehensive about providing a link even though they were already freely available, but I will anyway. [Here is one of them.](https://drive.google.com/file/d/1MlWHECKp-v70Zvizj-YcN_Zu0_gG09wd/view?usp=sharing) – Randall Ivan Carson Aug 12 '20 at 16:25
  • @RandallIvanCarson--see my updated answer (basically the additional portion with an update). – DarrylG Aug 12 '20 at 18:21
  • Unfortunately this solution has also returned a UnicodeDecodeError. Perhaps there is a difference between our operating systems or their settings. – Randall Ivan Carson Aug 13 '20 at 04:15
  • @RandallIvanCarson--discovered it worked on Linux but had your error on Windows 10. Anyway, I updated get_data and it now works on Windows 10 & Linux. Problem was related to this post [UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to ](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) – DarrylG Aug 13 '20 at 07:40
  • @RandallIvanCarson--issue seems to have been on Linux default encoding is UTF-8 while it's CP1252 on Windows. Explicitly specifying UTF-8 in function get_data fixed the problem. If interested see [Processing Text Files in Python 3](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#the-binary-option) for further details. – DarrylG Aug 13 '20 at 07:52