I am reading a file with one JSON object per line (ndjson)
dfjson = pd.read_json(path_or_buf=JsonFicMain,orient='records',lines=True)
Here is an example of 2 lines of the content of the dataframe (after dropping columns)
nomCommune codeCommune numeroComplet nomVoie codePostal meilleurePosition codesParcelles
0 Ablon-sur-Seine 94001 21 Rue Robert Schumann 94480 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}} [94001000AG0013]
1 Ablon-sur-Seine 94001 13 Rue Robert Schumann 94480 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}} [94001000AG0020]
It contents million of rows, I want to extract one geo coordinate, between square brackets, in a specific colum (named meilleurePosition). The expected output is
[2.411247955172414, 48.726054248275865]
I tried to either extract the coordinate or replace all other unwanted characters Using extractall, or extract does not match
test=dfjson['meilleurePosition'].str.extract(pat='(\d+\.\d+)')
test2=dfjson['meilleurePosition'].str.extractall(pat='(\d+\.\d+)')
Empty DataFrame
Columns: [0]
Index: []
Using replace, or str.replace does not work
test3=dfjson["meilleurePosition"].replace(to_replace=r'[^0-9.,:]',value='',regex=True)
0 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}}
1 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}}
Even none regex type does not work
test4=dfjson['meilleurePosition'].str.replace('type','whatever')
0 NaN
1 NaN
print(test)
I have tried to find why this does not work at all.
- Column type is 'object' (which is apparently good as this is a string)
- Using inplace=True without copying the dataframe leads to similar results
Why can't I manipulate this column, is it because of the special characters in it? How can get these coordinate in the good format?
OK, after more investigation, the column contains a nested dict, that's why it is not working This answer helped me a lot python pandas use map with regular expressions I did then use the following code to create a new column with the expected coordinates
def extract_coord(meilleurepositiondict):
if isinstance(meilleurepositiondict,dict) :
return meilleurepositiondict['geometry']['coordinates']
else :
return None
dfjson['meilleurePositionclean']=dfjson['meilleurePosition'].apply(lambda x: extract_coord(x))