0

I have a dataset/dataframe in this format:

gene : ABC
sample: XYX
input:23
.
.
.
gene : DEF
sample: ERT
input :24

.
.

it goes on and on.

How do I get it in this format?

gene sample input
abc   xyx   23
def    ert   24

.
.

Python or shell commands any will do.

I tried pd transpose but then it doesn't seem to give me a result I'm looking for, not getting the desired output.

Emma
  • 27,428
  • 11
  • 44
  • 69
foondar
  • 45
  • 4
  • i want the output in a csv format with gene, sample and input as the header, and the rest of the information below – foondar Aug 01 '19 at 18:21
  • It's difficult to understand your data input and output. You state you're using a dataframe, so perhaps have a look at [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and reformat your sample data so that we can help you better – G. Anderson Aug 01 '19 at 18:24
  • Its a dataset basically - more like a text file – foondar Aug 01 '19 at 19:59

1 Answers1

1

I'm not 100% sure what you're looking for. I'll give a couple examples of potential solutions. If these don't match up what you're looking for, please update your question or add a comment.

Set up (following your example info):

    import pandas as pd
    dict1 = {"gene": "ABC", "sample": "XYZ", "input": 23}
    dict2 = {"gene": "DEF", "sample": "ERT", "input": 24}
    columns = ["gene", "sample", "input"]
    df = pd.DataFrame([dict1, dict2], columns=columns)

The output of df looks like:

  gene sample  input
0  ABC    XYZ     23
1  DEF    ERT     24

That looks like what you're looking for in your questions. If that's true, you can use a similar set up (like the code block at the beginning) to set up this DataFrame.

If you mean you have that format and you're looking to transpose it, I would recommend the following:

    # columns will be the index from 0 to n-1:
    df.transpose()
    # output:
    #           0    1
    # gene    ABC  DEF
    # sample  XYZ  ERT
    # input    23   24

    # try this instead
    list_that_contains_n_items_to_be_columns = ["a", "b"]
    df.index = pd.Index(list_that_contains_n_items_to_be_columns)
    df.transpose()
    # output:
    #           a    b
    # gene    ABC  DEF
    # sample  XYZ  ERT
    # input    23   24

If you meant you have the info you posted in a text file like:

gene : ABC
sample: XYX
input:23
gene : DEF
sample: ERT
input :24

you would need to read it in and put it in a DataFrame (similar to csv format). You could do that by:

import pandas as pd
list_of_dicts = []
with open("data.txt") as f:
    number_columns = 3 # change this as necessary
    line_num = 0
    for line in f:
        if line_num % number_columns == 0:
            if line_num == 0:
                dict_row = {}
            else:
                list_of_dicts.append(dict_row)
                dict_row = {}
        line_num += 1
        (key, val) = line.split(":")
        dict_row[str(key)] = val.rstrip()

# add your columns to that list
df = pd.DataFrame(list_of_dicts, columns=["gene", "sample", "input"])
print(df)

This will read in your file, line by line and create a list of dictionaries, which is easy to turn into a pandas DataFrame. If you want an actual csv file, you can run df.to_csv("name_of_file.csv").

Hope one of these helps!

EDIT: To look over all files in a directory, you can add the following code in front of the loop:

    import glob
    for filename in glob.glob("/your/path/here/*.txt"):
        # code you want to execute

EDIT EDIT:

The question does not seem to relate to what is being asked (see the comments of this answer). It seems the author has .tsv files that are already in DataFrame-esque format and they want the files read in as DataFrames. The sample file given is:

Sample Name:    1234
Index:  IB04
Input DNA:  100

Detected ITD Variants:
Size    READS   VRF



Sample Name:    1235
Index:  IB05
Input DNA:  100

Detected Variants:
Size    READS   VRF
27  112995  4.44e-01
Total   112995  4.44e-01

Example code to read this file in and create a "Sample" DF:

#!/usr/bin/python
import os
import glob
import pandas as pd
os.chdir(os.getcwd())


def get_df(num_cols=3, start_key="Sample", switch_line=""):
    list_of_dfs = []
    for filepath in glob.glob("*.tsv"):
        list_of_dicts = []
        number_columns = num_cols
        line_num = 0
        part_of_df = False
        with open(filepath) as file:
            for line in file:
                # only read in lines to the df that are part of the dataframe
                if start_key in line:
                    part_of_df = True 
                elif line.strip() == "":
                    # if an empty line, go back to not adding it
                    part_of_df = False
                    continue
                if part_of_df:
                    # depending on the number of columns, add to the df
                    if line_num % number_columns == 0:
                        if line_num == 0:
                            dict_row = {}
                        else:
                            list_of_dicts.append(dict_row)
                            dict_row = {}
                    line_num += 1
                    (key, val) = line.split(":")
                    dict_row[str(key)] = val.rstrip().strip()
            if len(dict_row) % number_columns == 0:
                # if last added row is the last row of the file
                list_of_dicts.append(dict_row)
            df = pd.DataFrame(list_of_dicts, columns=['Sample Name','Index','Input DNA'])
        list_of_dfs.append(df)
    # concatenate all the files together
    final_df = pd.concat(list_of_dfs)
    return final_df

df_samples = get_df(num_cols=3, start_key="Sample", switch_line="")
print(df_samples)

This creates a DataFrame with the data for genes. If this created the dataset you are looking for, please mark this answer as accepted. Please ask a new question if you have further questions (posting a data file in the question is very helpful).

OrionTheHunter
  • 276
  • 1
  • 7
  • Did my edit help? If each file contains a single gene then you would add one dictionary for each file. You could also concatenate all those text files into one big text file and read that in. – OrionTheHunter Aug 01 '19 at 20:40
  • It did help, but I keep getting the error - Value error - need more than 1 value to unpack – foondar Aug 01 '19 at 21:13
  • It's hard to know what error you're having as I can't see your data files. Where is the error occurring? If your data has more than one ":" per line that would be an issue. – OrionTheHunter Aug 01 '19 at 21:48
  • @foodar, you'll have to post a link to the file or paste the whole file. I can't do much with those ten words. – OrionTheHunter Aug 01 '19 at 22:33
  • @foondar I have added your data file and code that collects the data into a dataframe. If you have further questions, please mark this answer as accepted and create a new question that more accurately describes your problem. – OrionTheHunter Aug 02 '19 at 19:50
  • So this works, but it breaks when there is a blank line, how do i continue iterating over the empty lines in my text file? – foondar Aug 05 '19 at 22:13
  • Is there a way to have multiple values associated to one key in this dictionary, astheer are multiple sizes in some files. – foondar Aug 08 '19 at 21:41