0

Here is my DataFrame -

In [106]: ogl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype
---  ------                       --------------    -----
 0   geolocation_zip_code_prefix  1000163 non-null  int64
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object
 4   geolocation_state            1000163 non-null  object
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB

It comes from the Brazilian E-Commerce Public Dataset by Olist, olist_geolocation_dataset.csv. Oddly enough, given geolocation_zip_code_prefix, geolocation_city and geolocation_state are not redundant information. For example row 49285: "03203",-23.598384873160597,-46.56677381072186,sao paulo,SP and row 51000: "03203",-23.216648333054426,-46.86137071772756,jundiaí,SP I was curious to know how well (geolocation_lat, geolocation_lng) could predict (geolocation_state, geolocation_city, geolocation_zip_code_prefix). The combination of these 3 fields could be thought as categories (such as (03203, sao paulo, SP)) which contain lists of (geolocation_lat, geolocation_lng) such as [(-23.598384873160597,-46.56677381072186), ...]. I thought this could be achieved with one-way ANOVA but now I am beginning to doubt this. How would I measure the strength of association, like Cramér's V but for predicting categories from quantitative data (geolocations)?

user2309803
  • 541
  • 5
  • 15
  • can you actually provide example output? and example input? This *really* helps. Like, what is a "comma separated list"? Lists *don't' have commas*. It is much more clear to give unambiguous examples, which is what a programming language is designed to do. See [this question for inspiration](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – juanpa.arrivillaga Jan 18 '21 at 17:42
  • Can you explain the idea of passing lists of geo-coordinates into anova test? In anova you ussually want to compare measurements from various groups. So your groups are zip codes here(cause city and state should reduce to zip_code_prefix), and you want to compare their coordinates using anova? I don't understand it here. – Michał89 Jan 18 '21 at 17:46
  • @juanpa.arrivillaga, @Michał89 I think my assumption that one-way ANOVA could measure the strength of association between `(geolocation_lat, geolocation_lng)` and `(geolocation_state, geolocation_city, geolocation_zip_code_prefix)` was wrong so I have tried to formulate a better question, please see above. – user2309803 Jan 18 '21 at 21:14
  • show what the results should look like. have you tried df.to_dict() – Golden Lion Jan 27 '21 at 17:23

0 Answers0