0

I am reading a file with one JSON object per line (ndjson)

dfjson = pd.read_json(path_or_buf=JsonFicMain,orient='records',lines=True)

Here is an example of 2 lines of the content of the dataframe (after dropping columns)

              nomCommune  codeCommune numeroComplet                    nomVoie  codePostal                                                                                            meilleurePosition    codesParcelles
0        Ablon-sur-Seine        94001            21        Rue Robert Schumann       94480  {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}}  [94001000AG0013]
1        Ablon-sur-Seine        94001            13        Rue Robert Schumann       94480   {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}}  [94001000AG0020]

It contents million of rows, I want to extract one geo coordinate, between square brackets, in a specific colum (named meilleurePosition). The expected output is

[2.411247955172414, 48.726054248275865]

I tried to either extract the coordinate or replace all other unwanted characters Using extractall, or extract does not match

test=dfjson['meilleurePosition'].str.extract(pat='(\d+\.\d+)')
test2=dfjson['meilleurePosition'].str.extractall(pat='(\d+\.\d+)')
Empty DataFrame
Columns: [0]
Index: []

Using replace, or str.replace does not work

test3=dfjson["meilleurePosition"].replace(to_replace=r'[^0-9.,:]',value='',regex=True)
0       {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}}
1        {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}}

Even none regex type does not work

test4=dfjson['meilleurePosition'].str.replace('type','whatever')
0      NaN
1      NaN

print(test)

I have tried to find why this does not work at all.

  • Column type is 'object' (which is apparently good as this is a string)
  • Using inplace=True without copying the dataframe leads to similar results

Why can't I manipulate this column, is it because of the special characters in it? How can get these coordinate in the good format?

OK, after more investigation, the column contains a nested dict, that's why it is not working This answer helped me a lot python pandas use map with regular expressions I did then use the following code to create a new column with the expected coordinates

def extract_coord(meilleurepositiondict):
    if isinstance(meilleurepositiondict,dict) :
        return meilleurepositiondict['geometry']['coordinates']
    else :
        return None

dfjson['meilleurePositionclean']=dfjson['meilleurePosition'].apply(lambda x: extract_coord(x))
Dharman
  • 30,962
  • 25
  • 85
  • 135
Myrt
  • 34
  • 4
  • Thanks to [simple function def to use with apply](https://stackoverflow.com/questions/32189964/pandas-make-function-map-partial-dict-match) [understanding if i had to use loop or apply](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200) – Myrt Aug 24 '20 at 08:37

1 Answers1

0

I found the solution using the code below

dfjson['meilleurePosition']=dfjson['meilleurePosition'].apply(lambda x: extract_coord(x) if x == x else defaultmeilleurepositionvalue)

this was required because of empty rows leading to error (not trapped in function definition). However ,i am still convinced there is much easy way to assign a dict value of a column to the column itself , still trying...

Myrt
  • 34
  • 4