0

Python 2.7

I have a Dataframe with two columns, coordinates and loc. coordinates contains 10 lat/long pairs and loc contains 10 strings.

The following code leads to a ValueError, arrays were different lengths. Seems like I'm not writing the condition correctly.

lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] !=  ['37.226582', '-95.70522299999999']] #ValueError

The error message is

File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin e 1283, in wrapper res = na_op(values, other) File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin e 1143, in na_op result = _comp_method_OBJECT_ARRAY(op, x, y) File "C:...\biney\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin e 1120, in _comp_method_OBJECT_ARRAY result = libops.vec_compare(x, y, op) File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare ValueError: Arrays were different lengths: 10 vs 2

My goal here is to actually check and eliminate all entries in the coordinates column that are equal to the list [37.226582, -95.70522299999999] so I want df['coordinates'] to print out [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']

I was hoping that this documentation would help, particularly the part that shows: "You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):" df[df['A'] > 0]

so it seems like I'm not quite getting the syntax right... But I'm stuck. How do I write set a condition for the cell value of a certain column and return a dataframe only containing rows with cells that meet that condition?

Byron Smith
  • 587
  • 10
  • 32

3 Answers3

2

can you consider this?:

df
    coordinates                 locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [29.7604267, -95.3698028]   [Houston, TX]
4   [47.6062095, -122.3320708]  [Seattle, WA]
5   [34.0232431, -84.3615555]   [Roswell, GA]
6   [31.9685988, -99.9018131]   [Texas]
7   [37.226582, -95.705222999]  [null]
8   [40.289918, -83.036372]     [??, passing by...]
9   [37.226582, -95.7052229999] [null]


df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]


    coordinates                 locs                 lat        lon
0   [37.09024, -95.712891]      [United States]      37.090240  -95.712891
1   [-37.605, 145.146]          [Doreen, Melbourne] -37.605000  145.146000
2   [43.0481962, -76.0488458]   [Upstate NY]         43.048196  -76.048846
3   [29.7604267, -95.3698028]   [Houston, TX]        29.760427  -95.369803
4   [47.6062095, -122.3320708]  [Seattle, WA]        47.606209  -122.332071
5   [34.0232431, -84.3615555]   [Roswell, GA]        34.023243  -84.361555
6   [31.9685988, -99.9018131]   [Texas]              31.968599  -99.901813
8   [40.289918, -83.036372]     [??, passing by...]  40.289918  -83.036372
Dickster
  • 2,969
  • 3
  • 23
  • 29
  • ah, so the day has finally come... I'm looking it up right now but I don't quite understand lambda functions (nor the map function). Could you try to explain in words what you're doing on the line `df['lat'] = ...` ? Actually it appears as though this isn't entirely working in my case because some of the entries in df['coordinates'] are just strings that say 'null'. So if you could help explain the lambda function here that could help me modify the code – Byron Smith Jun 27 '18 at 13:22
  • Seeming like Series.apply(func_obj) can be used to create more complex functions relative to lambda expressions https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply – Byron Smith Jun 27 '18 at 13:52
  • So I used a smaller version of my original dataset for this question. But it appears that I have some entries in `lst_of_10_coordinates` (well, when I grab more than 10 coordinates) that are only ["null"]. The single lambda expression that doesn't work, it gives an error upon encountering the `["null"]` list. Right now I'm just... making a less-elegant for loop that could help me successfully filter rows but if you have any knowledge of additional pandas commands that could make this task easier... (should I make a separate question?) – Byron Smith Jun 27 '18 at 15:09
  • Let's see if we can do it here. We need a scrubbing method to map across all your entries. Let's assume all are of type list with a single string entry. I'll reply later – Dickster Jun 27 '18 at 17:46
0

One issue if you look into the objects your dataframe is storing the coords as you see that it is a single string. the issue with the error you are getting seems to be that it is comparing the 10 element series .coordinates with a 2 element list and there is obviously a mismatch. using .values seemed to get around that.

df2 = pd.DataFrame([row if row[0]!= ['37.226582, -95.70522299999999'] else [np.nan, np.nan] for row in df.values ], columns=['coords', 'locs']).dropna()

user85779
  • 334
  • 2
  • 11
0

ok here is an approach to ensure you have clean data to operate on.

let's assume 4 entries with a dirty coordinate entry.

lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs


    coordinates                     locs
0   [37.09024, -95.712891]      [United States]
1   [-37.605, 145.146]          [Doreen, Melbourne]
2   [43.0481962, -76.0488458]   [Upstate NY]
3   [null]                      [Houston, TX]

now we make a cleaning method. You would really want to test the values using:

type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements 
each element can cast to float - etc. 
Each is valid to be a lat or a lon

However we will do it the dirty way using a try except.

def scrubber_drainer(value):
    try:
        # we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
        return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
    except:
        # return tuple (38.9072,77.0396) # swamp
        return tuple([0.0,0.0]) # some default

so the return is typically a tuple with 2 floats. If it can't become that we return a default (0.,0.).

now update the coordinates

df['coordinates'] = df['coordinates'].map(scrubber_drainer)

then we use this cool technique to split out the tuple

df[['lat', 'lon']] = df['coordinates'].apply(pd.Series)

and now you can use the np.isclose() to filter

Dickster
  • 2,969
  • 3
  • 23
  • 29