2

I am using PySpark Python3 - Spark 2.1.0 and I have the a list of differents list, such as:

lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]

This list have elements with different lengths. So now, I would like to create a DataFrame from this list, where the columns are the first attribute (i.e. 'FILE, NAME, SURNAME, BIRTHDATE, NATIONALITY) and the data is the second attribute.

As you can see, the second list has not the column 'BIRTHDATE', I need the DataFrame to create this column with a NaN or white space in this place.

Also, I need DataFrame to be like this:

FILE      NAME    SURNAME    BIRTHDATE   NATIONALITY
----------------------------------------------------
123.xml    ANA     LÓPEZ     05-05-2000    ESP

458.xml    JUAN    PÉREZ       NaN         ESP

789.xml    PEDRO   CASTRO     07-07-2007   ESP

The data of the lists have to be in the same columns.

I have done this code, but it doesn't seems like the table I'd like:

dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
d = dictOfWords
tabla = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictOfWords.items() ]))
tabla_final = tabla.transpose()
tabla_final

Also, I have done this:

dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
print(dictOfWords)
tabla = pd.DataFrame.from_dict(dictOfWords, orient='index')
tabla

And the result is not good.

I would like a pandas DataFrame and a Spark DataFrame if it is possible.

Thanks!!

  • Possible duplicate of [Generate a dataframe from list with different length](https://stackoverflow.com/questions/49891200/generate-a-dataframe-from-list-with-different-length) – rassar Nov 19 '18 at 16:27
  • Possible duplicate of [Creating dataframe from a dictionary where entries have different lengths](https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths) – Nico Haase Nov 19 '18 at 16:34
  • Do you want a pandas DataFrame or a spark DataFrame? – pault Nov 19 '18 at 17:08

1 Answers1

1

The following should work in your case:

In [5]: lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
   ...: ['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
   ...: ['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
   ...: ['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]

In [6]: pd.DataFrame(list(map(dict, lista_archivos)))
Out[6]:
    BIRTHDATE     FILE   NAME NATIONALITY SURNAME
0  05-05-2000  123.xml    ANA         ESP   LÓPEZ
1         NaN  458.xml   JUAN         ESP   PÉREZ
2  07-07-2007  789.xml  PEDRO         ESP  CASTRO

Essentially, you convert your sublists to dict objects, and feed a list of those to the data-frame constructor. The data-frame constructor works with list-of-dicts very naturally.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172