3

I have a defined range:

df = pd.DataFrame([["1", "10"], ["11", "67"], ["90", "115"]], columns=['start', 'end'])

And a list of strings:

df2 = pd.DataFrame([["1"], ["3"], ["31"], ["70"], ["71"], ["90"], ["99"], ["100"], ["200"]], columns=['reference'])

And I try to get a result that looks as follows:

df3 = pd.DataFrame([["1", "1", "10"], ["3", "1", "10"], ["31", "11", "67"], ["70", "no range", "no range"], 
                ["71", "no range", "no range"], ["90", "90", "115"], ["99", "90", "115"], 
                ["100", "90", "115"], ["200", "no range", "no range"]], columns=['reference', "start", "end"])

I tried to do something similar earlier on, but with using numpy only. The solution then looked like this:

result_good=[]
result_bad=[]
for d in extension:
    categories = np.logical_and(d >= ranges[:,1], d <= ranges[:,2])
    if (ranges[:,0][categories]):
        result_good.append(ranges)
    else:
        result_bad.append(d)

This basically worked. I want to get this to work with Pandas though. But all I get to work is to compare two dataframes of the same length or to do it "brute force" with a loop. There must be a more elegant way to do that. Thank you for your help.

SLglider
  • 267
  • 1
  • 4
  • 16

3 Answers3

1
ranges = pd.DataFrame([["1", "10"], ["11", "67"], ["90", "115"]], columns=['start', 'end']).astype(int)
items = pd.DataFrame([["1"], ["3"], ["31"], ["70"], ["71"], ["90"], ["99"], ["100"], ["200"]], columns=['reference']).astype(int)

Make a DataFrame with results

result_df = pd.DataFrame(index = items.index, columns= ['reference', 'range_start', 'range_end'],
    data={'reference': items['reference'], 'range_start': None, 'range_end': None},)

Fill the result_df

for num_range in ranges.itertuples():
    matched_items = items['reference'].between(num_range.start, num_range.end)
    result_df.loc[matched_items, ['range_start', 'range_end']] = (num_range.start, num_range.end)

result_df

   reference  range_start  range_end
0  1          1            10
1  3          1            10
2  31         11           67
3  70         None         None
4  71         None         None
5  90         90           115
6  99         90           115
7  100        90           115
8  200        None         None

Formatting

You can then do a .fillna() to replace the None with 'no range'

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • Hi Maarten, thank you for your solution and the detailed description. I have been trying Allen's and your solution. With both, I run into issues with the length of the str that need to be converted into int. The numbers can have up to 15 digits. Apparently, this is an issue for Python on my machine... – SLglider Jun 01 '17 at 14:03
  • I do not know, what exact version I was using back then. I am on Anaconda Python 3.6 now and working on different machines (Windows / Mac). But the issue I had came from the source data I was reading in. What I think is that some NaN values messed up the following procedures. But, I am just a programming hobbyist - so this is just me guessing. – SLglider Jul 13 '17 at 18:42
  • 1
    Numpy (and pandas) don't have support for NaN in int arrays, so it gets converted to float, which resulted in the loss of precision. Better to keep identifiers as strings – Maarten Fabré Jul 13 '17 at 19:01
0
#create a dict to store reference number and its start and end range.
d = {'default':{'start':'no range','end':'no range'}}
#populate the range dict
df.apply(lambda x: [d.update({str(k):{'start':x.start, 'end':x.end} for k in range(int(x.start),int(x.end)+1)})],axis=1)
#get the range start and end for each reference point and merge with df2.
df2.apply(lambda x: pd.Series(d.get(x.reference,d.get('default'))),axis=1).join(df2)[['reference','start','end']]
Out[240]: 
  reference     start       end
0         1         1        10
1         3         1        10
2        31        11        67
3        70  no range  no range
4        71  no range  no range
5        90        90       115
6        99        90       115
7       100        90       115
8       200  no range  no range
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • I am trying to use this solution, as it does not do an iteration. The problem I am getting though is, that my real problem deals with much longer strings (up to fifteen digits). When I am now trying to convert str to int, I am getting an overflow error. Any idea, what I could do? Or maybe, I am simply misinterpreting the error message? – SLglider May 31 '17 at 15:30
  • Is the 15digits string in reference or start and end columns? Are all the strings valid int in start and end columns? – Allen Qin May 31 '17 at 22:18
  • The problem came with the reference items. Anyway, all items (reference and ranges) are str. So, in "real life", I am dealing with phone numbers that can get up to 15 digits (or characters in this case). A sys.maxsize gives me: 9223372036854775807. Should this not be sufficient? – SLglider Jun 01 '17 at 07:03
  • I've updated the solution. If the large numbers only appear in the reference column, this should fix your problem. – Allen Qin Jun 01 '17 at 07:08
  • Hi Allen, thank you for the update. Sure enough, I do have the same issue with the start / end column: ValueError: ("invalid literal for int() with base 10: '4,9302E+11'", 'occurred at index 84') – SLglider Jun 01 '17 at 13:56
  • '4,9302E+11' is not even a valid number. How would you like to handle strings like this? – Allen Qin Jun 01 '17 at 19:48
  • I am reading this in from a CSV. There it still is a "normal" number. But trying to read it causes me problems. Just reading it transfers it to str. Trying to import it as int / changing it to int, results in the before mentioned issue. – SLglider Jun 02 '17 at 09:02
  • `'4,9302E+11' is not even a valid number.` it is with a decimal comma. About the conversion. If this is an identifier (like a telephone number) best keep it as string. (This applies if you work with Excel too). For the operation itself you can transfer it to `int` (or `uint`) but storing in the DataFram can be best done as `str`. Otherwise you lose the risk that when a `null` value is inserted in the column, the `int` gets converted to `float` and you loose precision – Maarten Fabré Jun 06 '17 at 09:06
  • I finally figured, why I had all sorts of problems to get my data as int. Your script was perfectly fine, but my source data had issues. Basically, there were two NULL entries in the "end" column. Python automatically converted them into floats. Another lesson learned. – SLglider Jun 06 '17 at 12:30
  • @MaartenFabré: I was just re-visiting this post today and just now really see, what you are saying. And you are so right, it makes it so much easier to treat phone numbers as strings. – SLglider Jul 13 '17 at 18:18
0

Inspired by this question. But this way can be used:

ranges = pd.DataFrame([["1", "10"], ["11", "67"], ["90", "115"]], columns=['start', 'end']).astype(int)
items = pd.DataFrame([["1"], ["3"], ["31"], ["70"], ["71"], ["90"], ["99"], ["100"], ["200"]], columns=['reference']).astype(int)

a = items.reference.values
bh = ranges.end.values
bl = ranges.start.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

df = pd.DataFrame(
    np.column_stack([items.values[i], ranges.values[j]]),
    columns=items.columns.append(ranges.columns)
)

# if you need not elements not in range

items.merge(df, on='reference', how='left')
Andrey Kolpakov
  • 446
  • 4
  • 7