1

I have file as follows:

I want to create separate dataframes when there is \n present. The value above the \n will be the header of that specific dataframe. I checked for a solution here but it works in different way. Any help appreciated. Thanks

file1.txt

10C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
1P1.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
13C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
40C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature

Expected output from the data:

df1

   10C.vcf
0  Allele
1  Consequence
2  IMPACT
3  SYMBOL
4  Gene
5  Feature_type
6  Feature

df2

   1P1.vcf
0  Allele
1  Consequence
2  IMPACT
3  SYMBOL
4  Gene
5  Feature_type
6  Feature

df3

   13C.vcf
0  Allele
1  Consequence
2  IMPACT
3  SYMBOL
4  Gene
5  Feature_type
6  Feature

df4

   40C.vcf
0  Allele
1  Consequence
2  IMPACT
3  SYMBOL
4  Gene
5  Feature_type
6  Feature

I want to create separate dataframes when there is \n present. The value above the \n will be the header of that specific dataframe. I checked for a solution here but it works in different way. Any help appreciated. Thanks

  • 1
    Can you write code that manually iterates over the lines of the file, and creates a list with each dataframe's contents? – Karl Knechtel Apr 13 '21 at 04:01
  • 1
    why dont you read the full file into a list, then split the list by `\n\n`. That will give you a list of lists. Then you can create a loop to create dataframes for each occurrence of a list item? – Joe Ferndz Apr 13 '21 at 04:09
  • I dont think you can give the dataframe name as `13C.vcf`. You will need to replace `.` with `_` – Joe Ferndz Apr 13 '21 at 04:39

2 Answers2

1

Here's how you want to do it. I have added comments in the code so you can understand what I am doing.

Approach: Create a globals variable. Then use the globals variable to define the dynamic dataframe. Then access the value as that is now your dynamic dataframe.

If you are not very clear about how this work, see this Stack Overflow solution where I am defining a dynamic dataframe

txt = '''10C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
1P1.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
13C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature
40C.vcf

Allele
Consequence
IMPACT
SYMBOL
Gene
Feature_type
Feature'''

import pandas as pd
#create a globals variable to setup dynamic dataframe names
glb = globals()

#use txt = f.read() to read all the rows in the file into a variable
#then split the file at double \n (\n\n)
txt_data = txt.split('\n\n')

#the first element in the list is the name of the first dataframe
#store it into dfname
dfname = txt_data[0]

#iterate through the rest of the list 
for i,t in enumerate(txt_data[1:]):
    #since each row contains \n, use .split() to remove the \n
    t = t.split()
    #last row contains the name of the next dataframe 
    #exclude the last element to create your dataframe
    data = t[:-1]
    
    #create the dataframe name
    dfname = 'df_' + dfname.replace('.','_')
    
    #the data is just a single list of items. So i am creating a pd.Series
    #the way to create the dataframe is globals[dataframe name] = pd.Series (values)
    glb[dfname] = pd.Series(t[:-1])
    
    #now store the name for the next pd.Series
    dfname = t[-1]
    
    #iterating through the full list will create separate dataframes

#here's the example to access each of the dynamic dataframes
print (df_10C_vcf)
print (df_1P1_vcf)
print (df_13C_vcf)
print (df_40C_vcf)

The output of this will be:

df_10C_vcf:

0          Allele
1     Consequence
2          IMPACT
3          SYMBOL
4            Gene
5    Feature_type
6         Feature

df_1P1_vcf:

0          Allele
1     Consequence
2          IMPACT
3          SYMBOL
4            Gene
5    Feature_type
6         Feature

df_13C_vcf:

0          Allele
1     Consequence
2          IMPACT
3          SYMBOL
4            Gene
5    Feature_type
6         Feature

df_40C_vcf:

0          Allele
1     Consequence
2          IMPACT
3          SYMBOL
4            Gene
5    Feature_type
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
0
import re
import pandas as pd
from io import StringIO

# 1. read your txt file as string
with open('file1.txt', 'r') as f:
    txt = f.read()

# 2. split your txt by `*.vcf`
datas = re.split(r'\n(?=[A-Za-z0-9]+\.vcf)', txt)
# Or
# datas = re.split(r'\n(?=.+\.vcf)', txt)
print(datas)

# 3. Create DF from string
for data in datas:
    print(pd.read_csv(StringIO(data), delim_whitespace=True))

Output

['10C.vcf\n\nAllele\nConsequence\nIMPACT\nSYMBOL\nGene\nFeature_type\nFeature', '1P1.vcf\n\nAllele\nConsequence\nIMPACT\nSYMBOL\nGene\nFeature_type\nFeature', '13C.vcf\n\nAllele\nConsequence\nIMPACT\nSYMBOL\nGene\nFeature_type\nFeature', '40C.vcf\n\nAllele\nConsequence\nIMPACT\nSYMBOL\nGene\nFeature_type\nFeature\n']


        10C.vcf
0        Allele
1   Consequence
2        IMPACT
3        SYMBOL
4          Gene
5  Feature_type
6       Feature
        1P1.vcf
0        Allele
1   Consequence
2        IMPACT
3        SYMBOL
4          Gene
5  Feature_type
6       Feature
        13C.vcf
0        Allele
1   Consequence
2        IMPACT
3        SYMBOL
4          Gene
5  Feature_type
6       Feature
        40C.vcf
0        Allele
1   Consequence
2        IMPACT
3        SYMBOL
4          Gene
5  Feature_type
6       Feature
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52