Add dictionary to pandas data frame and ignore extra values

Question

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.

As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.

for log in log_lines:
    # here logic to parse the log and generate the dictionary
    my_dict_list.append(d)
pd.Dataframe(my_dict_list)

In this way it adds all the keys and their values to the dataframe, but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.

my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
                {'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
                {'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]

Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.

is there a way I can achieve this, using pandas in faster way?.

I suggest you to create a dataframe with all values and the remove columns which user doesn't require. Because, removing key value pair in list of dict is slower than removing columns in pandas. to remove columns use this https://stackoverflow.com/questions/51167612/what-is-the-best-way-to-remove-column-in-pandas — Mohamed Thasin ah, Jul 25 '19 at 06:17
@MohamedThasinah yeah I can do that, but is there any way we can define dataframe to take only the columns that we defined? in that way we can read the data faster and avoid redoing the stuff — Murali, Jul 25 '19 at 06:21
you can't skip columns while calling pd.DataFrame(), you can pass `usecols` using `read_csv` anyway for that also you need your data as a file. If you have requirement of use same data in multiple places, I suggest you to convert your data as file and use `usecols`. It will save your time. — Mohamed Thasin ah, Jul 25 '19 at 06:24
Use `reindex` along `axis=1` ....? ie `pd.Dataframe(my_dict_list).reindex(['a', 'b', 'c'], axis=1)` — Chris Adams, Jul 25 '19 at 06:39

score 0 · Answer 1 · edited Jul 30 '19 at 10:16

0

I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.

  columns = []
  x = int(input('enter no of columns you need'))

  for i in range(x):
       print("Please specify columns")
       columns = int(input())
       columns.append(columns)

    my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
            {'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
            {'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]

   for data in range(x):
     value = pd.DataFrame(my_dict_list[columns[data]])
     print(value[[data]])

edited Jul 30 '19 at 10:16

marc_s

732,580
175
1,330
1,459

answered Jul 25 '19 at 06:29

Engineering Projects

80
1
8

May I know why columns are in `int`? OP's example of columns is `['a','b','c']` ? – Mohamed Thasin ah Jul 25 '19 at 06:33
I am considering it as coloums1 or 2 or 3 if you want in a,b,c you can change it according to your requirement brother I am new here on StackOverflow this is the first day I apologize for some mistakes – Engineering Projects Jul 25 '19 at 06:39
my intent is not criticise someone. It would be great if you provide the answer appropriately. you may update your answer accordingly. – Mohamed Thasin ah Jul 25 '19 at 06:44

score 0 · Answer 2 · answered Jul 25 '19 at 07:11

0

In tmp_Dict line you can filter only requested columns and save only requested columns.

def log_dataframe(log_lines, requested_columns):
    for log in log_lines:
        # here logic to parse the log and generate the dictionary

        tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
        my_dict_list.append(tmp_Dict)
    return pd.Dataframe(my_dict_list)

answered Jul 25 '19 at 07:11

Ilker Kurtulus

357
3
10

I kept a note, I don't want the filter to happen at the extraction level because I have to do this for millions of logs. I want this to happen with dataframe if there is a way. – Murali Jul 25 '19 at 07:57
then you can ignore tmp_Dict and simply do return pd.Dataframe(my_dict_list)[requested_columns]. However the first solution is more effective in terms of memory. If you want a solution for log parse, then you should share some logs to let us generate codes.. – Ilker Kurtulus Jul 25 '19 at 10:06

Add dictionary to pandas data frame and ignore extra values

2 Answers2