I am using PySpark Python3 - Spark 2.1.0 and I have the a list of differents list, such as:
lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
This list have elements with different lengths. So now, I would like to create a DataFrame from this list, where the columns are the first attribute (i.e. 'FILE, NAME, SURNAME, BIRTHDATE, NATIONALITY) and the data is the second attribute.
As you can see, the second list has not the column 'BIRTHDATE', I need the DataFrame to create this column with a NaN or white space in this place.
Also, I need DataFrame to be like this:
FILE NAME SURNAME BIRTHDATE NATIONALITY
----------------------------------------------------
123.xml ANA LÓPEZ 05-05-2000 ESP
458.xml JUAN PÉREZ NaN ESP
789.xml PEDRO CASTRO 07-07-2007 ESP
The data of the lists have to be in the same columns.
I have done this code, but it doesn't seems like the table I'd like:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
d = dictOfWords
tabla = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictOfWords.items() ]))
tabla_final = tabla.transpose()
tabla_final
Also, I have done this:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
print(dictOfWords)
tabla = pd.DataFrame.from_dict(dictOfWords, orient='index')
tabla
And the result is not good.
I would like a pandas DataFrame and a Spark DataFrame if it is possible.
Thanks!!