1

I am working on a script and using the pandas lib. I am new to the pandas lib so the question may be silly. I've imported my data from a csv into a pandas.dataframe. My data frame looks like below:

                      set1            set2               set3      set4  
0                     744110.0        507121.0           790001.0  785693.0   
1                     744107.0        507126.0           791002.0  788107.0   
2                     744208.0        535214.0           791103.0  788108.0   
3                     744210.0        534195.0           790116.0  784170.0

I am facing 2 problems:

Problem 1

The values in the csv are integer I don't know why or how is that .0 popping up, I don't want that to happen.

I create my dataFrame with the below code line:

df = pd.read_csv(file_path)

Problem 2

I want to do a search through the sets and get the name of the set that contains a value, for example: if I pass in the value 791103 the output should be name set3 as a string.

How can I achieve this in pandas

Please Note: different columns may have different number of items for example, set1 may have 500 total values while, set2 may just have 40

.to_dict('list') output:

{'set1': [744110.0, 744107.0, 744208.0, 744210.0], 'set2': [507121.0, 507126.0, 535214.0, 534195.0], 'set3': [790001.0, 791002.0, 791103.0, 790116.0], 'set4': [785693.0, 788107.0, 788108.0, 788170.0]}

Joe
  • 12,057
  • 5
  • 39
  • 55
iam.Carrot
  • 4,976
  • 2
  • 24
  • 71
  • For your first question, `pd.read_csv(filename, dtype = 'int')`. – Evan Jan 31 '18 at 19:30
  • One reason why you get the `.0` is because in your `df` there are NaNs. For problem 2 please share your `df.to_dict()` so we don't have to typeset your `df`. – rpanai Jan 31 '18 at 19:36
  • @user32185 I'll just post it, gimme a moment – iam.Carrot Jan 31 '18 at 19:37
  • 1
    For your second, this might work: `df.iloc[np.where(df == 791103)]`. Obv, the output might req. some formatting. – Evan Jan 31 '18 at 19:42

2 Answers2

2
import numpy as np
import pandas as pd

"""  set1  set2  set3  set4  
0  744110.0  507121.0  790001.0  785693.0
1  744107.0  507126.0  791002.0  788107.0
2  744208.0  535214.0  791103.0  788108.0
3  744210.0  534195.0  790116.0  784170.0
"""
df = pd.read_clipboard(sep='\s{2,}', engine='python', dtype = 'int')
df

For your first problem, you can set the data types upon import. As mentioned by @user32185, NaNs can cause issues when trying to cast as int.

pd.read_csv(filename, dtype = 'int')

For your second, I tried a couple of things, but this worked best:

import numpy as np
df.iloc[np.where(df == 791103)]

Output:

    set3
2   791103

To get just the column name:

df.iloc[np.where(df == 791103)].columns[0]

Output:

'set3'

Links:

Convert Pandas column containing NaNs to dtype `int`

https://chrisalbon.com/python/data_wrangling/pandas_create_column_using_conditional/

Evan
  • 2,121
  • 14
  • 27
  • thanks for the answer, the question is then what should I do if I have `nan` should I use `str` and do a string comparison instead? – iam.Carrot Jan 31 '18 at 19:49
  • Not sure if that'll work. I'm looking at `df.where`, but it's returning the whole (demo) df. Will update if I find something better than what I posted. – Evan Jan 31 '18 at 20:01
1

Another option to get the columns with the given value is

value = 791103
l = (df.values==value).any(axis=0)
cols = [df.columns[idx] for idx in np.where(l==True)[0]]

On my machine this takes 15.9 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) while Evans's answer takes 628 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • I don't understand why are you using a list comprehension, to avoid them I thought of using pandas. Along with that it returns me a collection while I need the string (not that getting a string is a huge deal) but I am not quite able to understand how is it working faster. would you please be able to explain it? – iam.Carrot Jan 31 '18 at 20:28
  • 1
    Pandas is not fast... this is a really nice solution. – Evan Jan 31 '18 at 20:31
  • @iam.Carrot Usually adding `.values` speed up things. See [docs](https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html). I'm using list of comprehension for two main reasons: 1) I didn't find anything else that could do the trick, 2) If you have more than a columns with the given value you have list of columns so why not? – rpanai Jan 31 '18 at 20:38
  • Great thanks. I've marked it as an answer since it's way more efficient. – iam.Carrot Jan 31 '18 at 20:48