python - Pandas: transform a DataFrame by splitting each value into multiple values -
i'd extract point values , frequencies of scrabble letters. wikipedia gives following table (copied http://en.wikipedia.org/wiki/scrabble_letter_distributions#english ).
# english scrabble points (rows) , frequencies (columns). scrabble_table = """ ×1 ×2 ×3 ×4 ×6 ×8 ×9 ×12 0 (blank) 1 l s u n r t o e 2 g d 3 b c m p 4 f h v w y 5 k 8 j x 10 q z """
which can import dataframe without problems.
pd.read_table(io.stringio(scrabble_table), index_col=0).fillna("")
image of resulting table.
this table has values such "l s u" in 1-point row , 4-frequency column. i'd table has 1 row per letter , 3 columns: letter, frequency, , point value. can suggest how transform original table want? thanks.
since spaces in, say, l s u
ignored, let's first remove spaces:
scrabble_table = scrabble_table.replace(' ', '')
now read table dataframe.
df = pd.read_table(stringio(scrabble_table), delimiter=',', index_col=0)
to move column level values new index level, use stack()
. since there no more column levels, stack()
returns series. reset_index()
moves index levels columns:
df = df.stack().reset_index() # level_0 level_1 0 # 0 0 ×2 (blank) # 1 1 ×4 lsu # 2 1 ×6 nrt # 3 1 ×8 o # 4 1 ×9 ai # 5 1 ×12 e # 6 2 ×3 g # 7 2 ×4 d # 8 3 ×2 bcmp # 9 4 ×2 fhvwy # 10 5 ×1 k # 11 8 ×1 jx # 12 10 ×1 qz
to place (blank)
on equal footing other tiles, let replace single character, such underscore (_
).
df = df.replace('(blank)', '_')
and while we're @ it, let's name columns meaningful names:
df.columns=['points', 'freq', 'letters']
now can form list comprehension iterates on rows using df.iterrows()
, , each row, iterate on letters form list of tuples. each tuple consists of 3 values: (row['points'], row['freq'], letter)
. passing list comprehension pd.dataframe
yields desired result:
df = pd.dataframe([(row['points'], row['freq'], letter) index, row in df.iterrows() letter in row[-1]], columns=['points', 'freq', 'letter'])
to make code easier cut-and-paste, i've replaced tabs commas:
import numpy np import pandas pd try: # python3 io import stringio except importerror: # python2 stringio import stringio scrabble_table = u'''\ ,×1 ,×2 ,×3 ,×4 ,×6 ,×8 ,×9 ,×12 0, ,(blank) 1,,,,l s u ,n r t ,o ,a ,e 2 ,,,g ,d 3 ,,b c m p,,,, 4 ,,f h v w y 5 ,k 8 ,j x 10 ,q z''' scrabble_table = scrabble_table.replace(' ', '') df = pd.read_table(stringio(scrabble_table), delimiter=',', index_col=0) df = df.stack().reset_index() df = df.replace('(blank)', '_') df.columns=['points', 'freq', 'letters'] df = pd.dataframe([(row['points'], row['freq'], letter) index, row in df.iterrows() letter in row[-1]], columns=['points', 'freq', 'letter']) print(df)
yields
points freq letter 0 0 ×2 _ 1 1 ×4 l 2 1 ×4 s 3 1 ×4 u 4 1 ×6 n 5 1 ×6 r 6 1 ×6 t 7 1 ×8 o 8 1 ×9 9 1 ×9 10 1 ×12 e 11 2 ×3 g 12 2 ×4 d 13 3 ×2 b 14 3 ×2 c 15 3 ×2 m 16 3 ×2 p 17 4 ×2 f 18 4 ×2 h 19 4 ×2 v 20 4 ×2 w 21 4 ×2 y 22 5 ×1 k 23 8 ×1 j 24 8 ×1 x 25 10 ×1 q 26 10 ×1 z
Comments
Post a Comment