python - Pandas: transform a DataFrame by splitting each value into multiple values -


i'd extract point values , frequencies of scrabble letters. wikipedia gives following table (copied http://en.wikipedia.org/wiki/scrabble_letter_distributions#english ).

# english scrabble points (rows) , frequencies (columns). scrabble_table = """     ×1  ×2  ×3  ×4  ×6  ×8  ×9  ×12 0       (blank)                      1               l s u   n r t   o   e 2           g   d                3       b c m p                      4       f h v w y                        5   k                            8   j x                          10  q z                          """ 

which can import dataframe without problems.

pd.read_table(io.stringio(scrabble_table), index_col=0).fillna("") 

image of resulting table.

this table has values such "l s u" in 1-point row , 4-frequency column. i'd table has 1 row per letter , 3 columns: letter, frequency, , point value. can suggest how transform original table want? thanks.

since spaces in, say, l s u ignored, let's first remove spaces:

scrabble_table = scrabble_table.replace(' ', '') 

now read table dataframe.

df = pd.read_table(stringio(scrabble_table), delimiter=',', index_col=0) 

to move column level values new index level, use stack(). since there no more column levels, stack() returns series. reset_index() moves index levels columns:

df = df.stack().reset_index() #     level_0 level_1        0 # 0         0      ×2  (blank) # 1         1      ×4      lsu # 2         1      ×6      nrt # 3         1      ×8        o # 4         1      ×9       ai # 5         1     ×12        e # 6         2      ×3        g # 7         2      ×4        d # 8         3      ×2     bcmp # 9         4      ×2    fhvwy # 10        5      ×1        k # 11        8      ×1       jx # 12       10      ×1       qz 

to place (blank) on equal footing other tiles, let replace single character, such underscore (_).

df = df.replace('(blank)', '_') 

and while we're @ it, let's name columns meaningful names:

df.columns=['points', 'freq', 'letters'] 

now can form list comprehension iterates on rows using df.iterrows(), , each row, iterate on letters form list of tuples. each tuple consists of 3 values: (row['points'], row['freq'], letter). passing list comprehension pd.dataframe yields desired result:

df = pd.dataframe([(row['points'], row['freq'], letter)                     index, row in df.iterrows()                     letter in row[-1]], columns=['points', 'freq', 'letter']) 

to make code easier cut-and-paste, i've replaced tabs commas:

import numpy np import pandas pd  try:     # python3     io import stringio except importerror:     # python2     stringio import stringio   scrabble_table = u'''\ ,×1 ,×2 ,×3 ,×4 ,×6 ,×8 ,×9 ,×12 0, ,(blank) 1,,,,l s u ,n r t ,o ,a ,e 2 ,,,g ,d 3 ,,b c m p,,,, 4 ,,f h v w y 5 ,k 8 ,j x 10 ,q z''' scrabble_table = scrabble_table.replace(' ', '') df = pd.read_table(stringio(scrabble_table), delimiter=',', index_col=0) df = df.stack().reset_index() df = df.replace('(blank)', '_') df.columns=['points', 'freq', 'letters'] df = pd.dataframe([(row['points'], row['freq'], letter)                     index, row in df.iterrows()                     letter in row[-1]], columns=['points', 'freq', 'letter']) print(df) 

yields

    points freq letter 0        0   ×2      _ 1        1   ×4      l 2        1   ×4      s 3        1   ×4      u 4        1   ×6      n 5        1   ×6      r 6        1   ×6      t 7        1   ×8      o 8        1   ×9      9        1   ×9      10       1  ×12      e 11       2   ×3      g 12       2   ×4      d 13       3   ×2      b 14       3   ×2      c 15       3   ×2      m 16       3   ×2      p 17       4   ×2      f 18       4   ×2      h 19       4   ×2      v 20       4   ×2      w 21       4   ×2      y 22       5   ×1      k 23       8   ×1      j 24       8   ×1      x 25      10   ×1      q 26      10   ×1      z 

Comments

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -