2

I am trying to fit this data:

[['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

to pandas DataFrame.

Not all labels are present in each record, for example some records have 'Mileage' and others don't and vice versa. I have a total of 26 features and very few items have all of them.

I would like to construct pandas DataFrame that will hold features in columns and if feature doesn't exists than content should be 'NaN'.

I have

colnames=['Manufacturer', 'Model', 'Mileage', 'Registered', 'Year'...(all 26 features here)] 
df = pd.read_csv("./data/output.csv", sep=",", names=colnames, header=None)

Few first prerequisite columns are giving expected output but when it comes to optional features than missing data causing features after that to turn out under wrong columns. Records are mapped correctly only if all features are present.

I forgot to mention that some features that are missing value also don't have ":" but are present in list. So in this 2 cases:

  • 'Mileage', (value missing, but also ':' is missing)
  • missing 'Mileage' from record altogheter

assignment for both cases should be 'NaN'.

Hrvoje
  • 13,566
  • 7
  • 90
  • 104

1 Answers1

4

Use nested list comprehension for list of dictionaries and pass to DataFrame contructor, if same key is missing is added NaN:

L = [['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

df = pd.DataFrame([dict(y.split(':') for y in x) for x in L])
print (df)
  Manufacturer     Mileage    Model Registered   Year
0      Hyundai   258000 km   Tucson    07/2019    NaN
1        Mazda         NaN        6    07/2019   2014

EDIT: You can use .split(maxsplit=1) for split by first whitespace:

L = [['Manufacturer Hyundai',
  'Model Tucson',
  'Mileage 258000 km',
  'Registered 07/2019'],
 ['Manufacturer Mazda',
  'Model 6',
  'Year 2014',
  'Registered 07/2019']]


df = pd.DataFrame([dict(y.split(maxsplit=1) for y in x) for x in L])
print (df)

  Manufacturer    Mileage   Model Registered  Year
0      Hyundai  258000 km  Tucson    07/2019   NaN
1        Mazda        NaN       6    07/2019  2014

EDIT:

L = [['Manufacturer  Hyundai',
  'Model  Tucson',
  'Mileage  258000 km',
  'Registered  07/2019'],
 ['Manufacturer  Mazda',
  'Model  6',
  'Year  2014',
  'Registered  07/2019',
  'Additional equipment aaa']]

words2 = ['Additional equipment']

L1 = []
for x in L:
    di = {}
    for y in x:
        for word in words2:
            if set(word.split(maxsplit=2)[:2]) < set(y.split()):
                i, j, k = y.split(maxsplit=2)
                di['_'.join([i, j])] = k
            else:
                i, j = y.split(maxsplit=1)
                di[i] = j
    L1.append(di)

df = pd.DataFrame(L1)
print (df)
  Additional_equipment Manufacturer    Mileage   Model Registered  Year
0                  NaN      Hyundai  258000 km  Tucson    07/2019   NaN
1                  aaa        Mazda        NaN       6    07/2019  2014
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • It says : ValueError: dictionary update sequence element #18 has length 1; 2 is required . Thats first empty element in first record. – Hrvoje Nov 30 '18 at 10:38
  • My mistake. actually those elements are not just missing, they don't have ":" after feature name. – Hrvoje Nov 30 '18 at 10:39
  • Your 1st version work perfectly on records with missing features. I'm already super-thankful for that! Unfortunately updated version is not working on records where features have 2 words ('Additional equipment',) for example . – Hrvoje Nov 30 '18 at 11:15
  • 1
    Awesom! Works perfectly! Thank you sooooo much! – Hrvoje Nov 30 '18 at 14:07