3

I have 2 txt files I'd like to read into python: 1) A map file, 2) A data file. I'd like to have a lookup table or dictionary read the values from TWO COLUMNS of one, and determine which value to put in the 3rd column using something like the pandas.map function. The real map file is ~700,000 lines, and the real data file is ~10 million lines.

Toy Dataframe (or I could recreate as a dictionary) - Map

Chr     Position   Name
1       1000       SNPA
1       2000       SNPB
2       1000       SNPC
2       2000       SNPD

Toy Dataframe - Data File

Chr     Position
1       1000
1       2000
2       1000
2       2001

Resulting final table:

Chr     Position   Name
1       1000       SNPA
1       2000       SNPB
2       1000       SNPC
2       2001       NaN

I found several questions about this with only one column lookup: Adding a new pandas column with mapped value from a dictionary. But can't seem to find a way to use 2 columns. I'm also open to other packages that may handle genomic data.

As a bonus second question, it'd also be nice if there was a way to map the 3rd column if it was with a certain amount of the mapped value. In other words, row 4 of the resulting table above would map to SNPD, as it's only 1 away. But I'd be happy to just get the solution for above.

Community
  • 1
  • 1
Gaius Augustus
  • 940
  • 2
  • 15
  • 37

1 Answers1

3

i would do it this way:

read your map data so that first two columns will become an index:

dfm = pd.read_csv('/path/to/map.csv', delim_whitespace=True, index_col=[0,1])

change delim_whitespace=True to sep=',' if you have , as a delimiter

read up your DF (setting the same index):

df = pd.read_csv('/path/to/data.csv', delim_whitespace=True, index_col=[0,1])

join your DFs:

df.join(dfm)

Output:

In [147]: df.join(dfm)
Out[147]:
              Name
Chr Position
1   1000      SNPA
    2000      SNPB
2   1000      SNPC
    2001       NaN

PS for the bonus question try something like this

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • How can I do this if my dfs have different column names? I'm just getting started in Python & pandas (I usually work in R). Do I rename the column, or is there another way? – Gaius Augustus May 03 '16 at 00:01
  • @GaiusAugustus, there are three functions in Pandas that might be used for joining: `join`, `merge` and `concat` - read [docs](http://pandas.pydata.org/pandas-docs/stable/merging.html) about them it'll help you to have a good start – MaxU - stand with Ukraine May 10 '16 at 18:03