0

Say I have a data file given below. The given awk command splits the files into multiple parts using the first value of the column and writes it to a file.

chr pos idx len
2   23  4   4   
2   25  7   3   
2   29  8   2   
2   35  1   5   
3   37  2   5   
3   39  3   3   
3   41  6   3   
3   45  5   5   
4   25  3   4   
4   32  6   3   
4   38  5   4   

awk 'BEGIN {FS=OFS="\t"} {print > "file_"$1".txt"}' write_multiprocessData.txt

The above code will split the files as file_2.txt, file_3.txt ... . Since, awk loads the file into memory first. I rather want to write a python script that would call awk and split the file and directly load it into linux memory (and give unique variable names to the data as file_1, file_2).

Would this be possible? If not what other variations can I try.

everestial007
  • 6,665
  • 7
  • 32
  • 72
  • Why not call awk from python directly and work on the output from awk on stdout in python? – ayushgp Feb 13 '18 at 03:48
  • I meant that I would be calling `awk` from python. But, looks like wording threw you off. But, can you show me as a code? – everestial007 Feb 13 '18 at 03:50
  • Why do you need awk at all? You can split the data in Python. And are you sure that awk loads the whole file into the RAM? – DYZ Feb 13 '18 at 03:50
  • I know I can split the data in python. But python is slow for big files. I am thinking I can split the file using awk and then pipe them for multprocessing, since the data analysis is CPU bound, and I want to run computation for each group of file (based on values of first column). – everestial007 Feb 13 '18 at 03:52
  • You can refer to this answer: https://stackoverflow.com/questions/2502833/store-output-of-subprocess-popen-call-in-a-string – ayushgp Feb 13 '18 at 03:53
  • I just updated my title. I am hoping it makes more sense now. – everestial007 Feb 13 '18 at 04:17
  • Awk does not load the entire file into memory. Its processing model is purely a record at a time, where out of the box (and in your case ) a record is a single line. – tripleee Feb 13 '18 at 04:26
  • @Evert: That was my first go. But, I am running into a computational burden and I am not very apt in multithreading/processing, though I tried to tirelessly read about it. I have this one particular problem https://stackoverflow.com/questions/48737403/how-to-run-multiprocessing-and-or-multithreading-in-the-given-data-and-python-pr and I thought what if I bring `awk` into play. Can you look into the problem and suggest me something. The two answers there are totally not addressing my concern. – everestial007 Feb 13 '18 at 04:28
  • @tripleee : good to know that I had a wrong idea about awk. Can you please look into this problem https://stackoverflow.com/questions/48737403/how-to-run-multiprocessing-and-or-multithreading-in-the-given-data-and-python-pr . I was thinking if I could use awk with similar data problem, but multithreading might be a way to go. I would like to hear how you would propose to do the python analyses for different chromosome in parallel in multiple cores. – everestial007 Feb 13 '18 at 04:30
  • Hi @Evert : I already have a python code to do my analyses. I made a mock python script which is in the link I shared ealier in comment. But, since the problem is CPU bound, I want to do it separately for each `chr` field in parallel (in multiple cores/processes). Any idea, you can add comments in the main question here: https://stackoverflow.com/questions/48737403/how-to-run-multiprocessing-and-or-multithreading-in-the-given-data-and-python-pr – everestial007 Feb 13 '18 at 04:32
  • The file needs to be split by `chr` field, which is the very first column. – everestial007 Feb 13 '18 at 04:34

2 Answers2

1

I think your awk code has a little bug. If you want to incorporate your awk code into a python code that organizes all the things you wanna do try this:

import os
from numpy import *


os.system("awk '{if(NR>1) print >\"file_\"$1\".txt\"}' test.dat")

os.system works very well, however I did not know it is obsolescence. Anyway, as suggested subprocess works as well:

import subprocess

cmd = "awk '{if(NR>1) print >\"file_\"$1\".txt\"}' test.dat"
p = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True)
cyclomen
  • 94
  • 7
  • I am not very good at awk, but my awk code given above has been working fine to produce the file as I wanted (split based on values of first column and files are named based on those values). I will check again to make sure. – everestial007 Feb 13 '18 at 16:33
  • `os.system()` is obsolescent and should not be recommended. Use `subprocess.run()` instead. The `numpy` import is completely gratuitous. Using Python's triple quotes would be vastly preferrable to littering the code with backslashes before all double quotes. – tripleee Feb 13 '18 at 16:40
  • @everestial007 I did edit another snippet that calls your or any awk command. Is that what your looking for ? – cyclomen Feb 14 '18 at 22:48
0

There is no need for Awk here.

from collections import defaultdict
prefix = defaultdict(list)
with open('Data.txt', 'r') as data:
    for line in data:
        line = line.rstrip('\r\n')
        prefix[line.split()[0]].append(line)

Now you have in the dict prefix all the first fields from all the lines as keys, and the list of lines with that prefix as the value for each key.

If you also wish to write the results into files at this point, that's an easy exercise.

Generally, simple Awk scripts are nearly always easy and natural to reimplement in Python. Because Awk is very specialized for a constrained set of tasks, the Python code will often be less succinct, but with the Python adage "explicit is better than implicit" in mind, this may actually be a feature from a legibility and maintanability point of view.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • thanks for the insight. The reason I turned to awk is to solve one of the multithreading/processing problem I am having. Have look at this link https://stackoverflow.com/questions/48737403/how-to-run-multiprocessing-and-or-multithreading-in-the-given-data-and-python-pr . Given that my python analyses is CPU bound, I want to run the computation for different `chr` in parallel. Can you add something, so I can get grasp of how to solve my issue. – everestial007 Feb 13 '18 at 04:39
  • This reminds me of doing a bulkload on a massively parallel database system. You can have one process read the data and pass the uncategorized lines out to n other processes. Then have each of those n processes categorize and append the sorted data to the thread-safe queue for a given category - one threadsafe queue per category. This should be fast on a system with lots of cores. AFAIK, awk is good for single-core tasks, but not ideal for multicore. – dstromberg Feb 13 '18 at 04:54