3

I am trying to put specific columns of tab separated files into a dictionary. I am trying several things and none give me the result I am looking for.

I have for example this file:

Name   Start   End   Size
del1   100     105   5
del2   150     160   10
del3   250     300   50

and this file, both .csv

Name   Qual   StartB  EndB  Size
inv1   6      400     405   5
inv2   7      450     460   10
inv3   20     450     400   50

What I want is something like this, where Name is the key and the others are values, additionally I have the problem of changing headers and indexes of headers, but they mean the same thing:

del_dict{del1: {Start: 100, End: 105, Size:5} del2: {etc}

I tried reading the file in several ways, based on other stack overflow answers.

for file in glob.glob(directoryname + "/*.csv"):
    dict = pd.read_csv(file, squeeze=False, sep="\t").to_dict()
        print(dict)

and

for file in glob.glob(directoryname + "/*.csv"):
    df = pd.read_csv(open(file, 'r'), header=0, sep="\t")
    if "StartB" in df.keys():
        name = df.Name
        start_pos = df.StartB
        end_pos = df.EndB
    else:
        name = df.Name
        start_pos = df.Start
        end_pos = df.End

But this gives me dataframes, that I cannot seem to fit in that into a dictionary.

I also tried this code, which I used before, but then it was only one file and no changing headers and then it will result in too many loops and hard coding to digest everything I need, based on the file I open.

for file in glob.glob(directoryname + "/*.csv"):
    with open(file, 'r') as csvfile:
        csv_list = []
        for line in csvfile:
            csv_list.append(line.strip("\t"))

I am fairly new to python, and I know a relatively simply answer must be available, but I cannot seem to find it. Sorry if the answer is already on stack overflow, I tried for hours to find a similar/workable problem and this is the point I am really getting stuck.

Fini
  • 163
  • 10

2 Answers2

1

I think need create index by column Name with DataFrame.set_index and then call DataFrame.to_dict with parameter orient='index':

df = pd.read_csv(file,  sep="\t")

d = df.set_index('Name').to_dict(orient='index')
print (d)
{'del1': {'Start': 100, 'End': 105, 'Size': 5},
 'del2': {'Start': 150, 'End': 160, 'Size': 10},
 'del3': {'Start': 250, 'End': 300, 'Size': 50}}

EDIT - You can rename columns names by dictionary and select by list columns for export to dict:

d = {'StartB':'Start','EndB':'End'}
d = df.set_index('Name').rename(columns=d)[['Start','End','Size']].to_dict(orient='index')
print (d)
{'inv1': {'Start': 400, 'End': 405, 'Size': 5}, 
 'inv2': {'Start': 450, 'End': 460, 'Size': 10}, 
 'inv3': {'Start': 450, 'End': 400, 'Size': 50}}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the quick reply. It seems to work out!. But now I have the whole file in the dictionary (which is much bigger than the example I gave above). In the simplyfied example above, I dont need the "Qual" column and the "StartB" is the same as "Start" (only Start and End have this problem). Could you help me with this problem as well? – Fini Oct 01 '18 at 11:58
  • 1
    @Fini - Solution is use dictionary for rename, check edited answer. Also if there is some pattern - e.g. each file haveifferent only last letter, dict should be created dynamically. Let me know how working solution. – jezrael Oct 01 '18 at 12:05
  • Thanks! I was already trying to rename the dictionarys, but this solution works much better. It is the desired result right now, but of course I need to implement this in my actual file and get more column out of my files, but that should be no problem. The only problem I encountered right now is that one file has an extra column I need named "Duplications", but adding that gives of course an error that Duplications is not in the index (in certain files). Because that was not the scope of the question, I will mark this as anwer. Thanks! – Fini Oct 01 '18 at 12:12
  • @Fini - You are welcome! If want removve some column which is not in each DataFrame, use [this](https://stackoverflow.com/a/36958937) solution. – jezrael Oct 01 '18 at 12:14
  • It was actually the other way around. To put the problem in this example, i DO need the column "Qual". But this column is only in file 2 and not in file 1, therefore you get a keyerror. But I will try to find the answer somewhere. I guess I have to use an if statement like: if "Qual" not in df.keys(): your code, else: code with added "Qual" – Fini Oct 01 '18 at 12:21
0

Based on the help and expertise of @jezrael (marked as answer), I put my final code here, where I also merged the dictionaries in the format that I needed and found a solution to the column that do not exist in every file.

Please let me know if this is not the way to go on stack overflow.

    csv_dict = {}
    for file in glob.glob(directoryname + "/*.csv"):
        df = pd.read_csv(file, sep="\t")
        d = {'StartB': 'Start', 'EndB': 'End'}
        if "Duplications" in df.keys():
            d = df.set_index('Name').rename(columns=d)[['Start', 'End', 'Size', 'Duplications']].to_dict(orient='index')
            csv_dict.update(d)
        else:
            d = df.set_index('Name').rename(columns=d)[['Start', 'End', 'Size']].to_dict(orient='index')
            csv_dict.update(d)
    print(csv_dict)

result:

{'del1': {'Start': 969261, 'End': 969270, 'Size': 10}, 
'del2': {'Start': 641573, 'End': 641672, 'Size': 100}, 
'del3': {'Start': 998620, 'End': 999119, 'Size': 500}, 
'dup1': {'Start': 595662, 'End': 595671, 'Size': 10, 'Duplications': 3}, 
'dup2': {'Start': 321225, 'End': 321324, 'Size': 100, 'Duplications': 3}, 
'dup3': {'Start': 971634, 'End': 972133, 'Size': 500, 'Duplications': 10}, 
'inv1': {'Start': 818450, 'End': 818459, 'Size': 10}, 
'inv2': {'Start': 991098, 'End': 991197, 'Size': 100},
'inv3': {'Start': 219635, 'End': 220134, 'Size': 500}}
Fini
  • 163
  • 10